Best questions in January 2012

Java += operator

330 votes

until today I thought that for example:

i += j;

is just a shortcut for:

i = i + j;

But what if we try this:

int i = 5;
long j = 8;

Then i = i + j; will not compile but i += j; will compile fine.

Does it mean that in fact i += j; is a shortcut for something like this i = (type of i) (i + j)?

I've tried googling for it but couldn't find anything relevant.

As always with these questions, the JLS holds the answer. In this case §15.26.2 Compound Assignment Operators. An extract:

A compound assignment expression of the form E1 op= E2 is equivalent to E1 = (T)((E1) op (E2)), where T is the type of E1, except that E1 is evaluated only once.

And an example:

For example, the following code is correct:

short x = 3;
x += 4.6;

and results in x having the value 7 because it is equivalent to:

short x = 3;
x = (short)(x + 4.6);

In other words, you're right with your assumption

Is there a reason for C#'s reuse of the variable in a foreach?

290 votes

When using lambda expressions or anonymous methods in C#, we have to be wary of the access to modified closure pitfall. For example:

    foreach (var s in strings)
    {
        query = query.Where(i => i.Prop == s); // access to modified closure

Due to the modified closure, the above code will cause all of the Where clauses on the query to be based on the final value of s.

As explained here, this happens because the s variable declared in foreach loop above is translated like this in the compiler:

    string s;
    while (enumerator.MoveNext())
    {
        s = enumerator.Current;
        ...

... instead of like this:

    while (enumerator.MoveNext())
    {
        string s;
        s = enumerator.Current;

As pointed out here, there are no performance advantages to declaring a variable outside the loop, and under normal circumstances the only reason I can think of for doing this is if you plan to use the variable outside the scope of the loop:

    } // end while
    var finalString = s;

However, variables defined in a foreach loop cannot be used outside the loop:

foreach(string s in strings)
{
}
var finalString = s; // won't work: you're outside the scope.

So the compiler declares the variable in a way that makes it highly prone to an error that is often difficult to find and debug, while producing no perceivable benefits.

Is there something you can do with foreach loops this way that you couldn't if they were compiled with an inner-scoped variable, or is this just an arbitrary choice that was made before anonymous methods and lambda expressions were available or common, and which hasn't been revised since then?

The compiler declares the variable in a way that makes it highly prone to an error that is often difficult to find and debug, while producing no perceivable benefits.

Your criticism is entirely justified.

I discuss this problem in detail here:

http://blogs.msdn.com/b/ericlippert/archive/2009/11/12/closing-over-the-loop-variable-considered-harmful.aspx

Is there something you can do with foreach loops this way that you couldn't if they were compiled with an inner-scoped variable? or is this just an arbitrary choice that was made before anonymous methods and lambda expressions were available or common, and which hasn't been revised since then?

The latter. The C# 1.0 specification actually did not say whether the loop variable was inside or outside the loop body, as it make no observable difference. When closure semantics were introduced in C# 2.0, the choice was made to put the loop variable outside the loop, consistent with the "for" loop.

I think it is fair to say that all regret that decision. This is one of the worst "gotchas" in C#, and we are going to take the breaking change to fix it. In C# 5 the foreach loop variable will be logically inside the body of the loop, and therefore closures will get a fresh copy every time.

The for loop will not be changed, and the change will not be "back ported" to previous versions of C#. You should therefore continue to be careful when using this idiom.

Try-catch speeding up my code?

288 votes

I wrote some code for testing the impact of try-catch, but seeing some surprising results.

static void Main(string[] args)
{
    Thread.CurrentThread.Priority = ThreadPriority.Highest;
    Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.RealTime;

    long start = 0, stop = 0, elapsed = 0;
    double avg = 0.0;

    long temp = Fibo(1);

    for (int i = 1; i < 100000000; i++)
    {
        start = Stopwatch.GetTimestamp();
        temp = Fibo(100);
        stop = Stopwatch.GetTimestamp();

        elapsed = stop - start;
        avg = avg + ((double)elapsed - avg) / i;
    }

    Console.WriteLine("Elapsed: " + avg);
    Console.ReadKey();
}

static long Fibo(int n)
{
    long n1 = 1, n2 = 1, fibo = 1;
    n++;

    for (int i = 2; i < n; i++)
    {
        n1 = n2;
        n2 = fibo;
        fibo = n1 + n2;
    }

    return fibo;
}

On my computer, this consistently prints out a value around 0.96..

When I wrap the for loop inside Fibo() with a try-catch block like this:

static long Fibo(int n)
{
    long n1 = 1, n2 = 1, fibo = 1;
    n++;

    try
    {
        for (int i = 2; i < n; i++)
        {
            n1 = n2;
            n2 = fibo;
            fibo = n1 + n2;
        }
    }
    catch {}

    return fibo;
}

Now it consistently prints out 0.69... -- it actually runs faster! But why?

Note: I compiled this using the Release configuration and directly ran the exe (outside Visual Studio).

EDIT: Jon Skeet's excellent analysis shows that try-catch is somehow causing the x86 CLR to use the cpu registers in a more favorable way in this specific case (and I think we're yet to understand why). I confirmed Jon's finding that x64 CLR doesn't have this difference, and that it was faster than the x86 CLR. I also tested using int types inside the Fibo method instead of long types, and then the x86 CLR was as equally fast as the x64 CLR.

One of the Roslyn engineers who specializes in understanding optimization of stack usage took a look at this and reports to me that there seems to be a problem in the interaction between the way the C# compiler generates local variable stores and the way the jit compiler does register scheduling in the corresponding x86 code. The result is suboptimal codegen on the loads and stores of the locals.

For some reason unclear to all of us, the problematic codegen path is avoided when the jitter knows that the block is in a try-protected region.

This is pretty weird. We'll follow up with the jitter team and see if we can get a bug entered so that they can fix this up.

Also, we are working on improvements for Roslyn to the C# and VB compilers' algorithms for determining when locals can be made "ephemeral" -- that is, just pushed and popped on the stack, rather than allocated a specific location on the stack for the duration of the activation. We believe that the jitter will be able to do a better job of register allocation and whatnot if we give it better hints about when locals can be made "dead" earlier.

Thanks for bringing this to our attention, and apologies for the odd behaviour.

Why is char[] preferred over string for passwords?

285 votes

In Swing, the password field has a getPassword() (returns char[]) method instead of usual getText() (returns String) method. Similarly, I have come across a suggestion not to use Strings to handle passwords. Why does String pose a threat to security when it comes to passwords? It feels inconvenient to use char[]. Can anyone explain this to me?

Strings are immutable. That means once you've created the string, if another process can dump memory, there's no way (aside from reflection) you can get rid of the data before GC kicks in.

With an array, you can explicitly wipe the data after you're done with it: you can overwrite the array with anything you like, and the password won't be present anywhere in the system, even before garbage collection.

So yes, this is a security concern - but even using char[] only reduces the window of opportunity for an attacker, and it's only for this specific type of attack.

EDIT: As noted in comments, it's possible that arrays being moved by the garbage collector will leave stray copies of the data in memory. I believe this is implementation-specific - the GC may clear all memory as it goes, to avoid this sort of thing. Even if it does, there's still the time during which the char[] contains the actual characters as an attack window.

Can anyone explain these bizarre Javascript behaviours mentioned in the 'Wat' talk for CodeMash 2012?

235 votes

The talk is here: https://www.destroyallsoftware.com/talks/wat

It basically points out a few bizarre quirks with Ruby and Javascript.

I have made a JSFiddle of the results here: http://jsfiddle.net/fe479/1/

The behaviours specific to Javascript (As I don't know Ruby) are listed below.

I found in the JSFiddle that some of my results didn't correspond with those in the video, I am not sure why. I am however curious to know how Javascript is handling working behind the scenes in each case.

Empty Array + Empty Array
[] + []
result:
<Empty String>

I am quite curious about the + operator when used with arrays in Javascript. This matches the video's result.

Empty Array + Object
[] + {}
result:
[Object]

This matches the video's result. What's going on here? Why is this an object. What does the + operator do?

Object + Empty Array
{} + []
result
[Object]

This doesn't match the video. The video suggests that the result is 0, whereas I get [Object].

Object + Object
{} + {}
result:
[Object][Object]

This doesn't match the video either, nor can I understand how outputting a variable can result in two objects. Any ideas? Maybe my JSFiddle is wrong.

Array(16).join("wat" - 1)
result:
NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

doing wat + 1 results in wat1wat1wat1wat1...

I suspect this is just straightforward behaviour that trying to subtract a number from a string results in NaN.

Any explanations?

Here's a list of explanations for the results you're seeing (and supposed to be seeing). The references I'm using are from the ECMA-262 standard.

  1. [] + []

    When using the addition operator, both the left and right operands are converted to primitives first (§11.6.1). As per §9.1, converting an object (in this case an array) to a primitive returns its default value, which for objects with a valid toString() method is the result of calling object.toString() (§8.12.8). For arrays this is the same as calling array.join() (§15.4.4.2). Joining an empty array results in an empty string, so step #7 of the addition operator returns the concatenation of two empty strings, which is the empty string.

  2. [] + {}

    Similar to [] + [], both operands are converted to primitives first. For "Object objects" (§15.2), this is again the result of calling object.toString(), which for non-null, non-undefined objects is "[object Object]" (§15.2.4.2).

  3. {} + []

    The {} here is not parsed as an object, but instead as an empty block (§12.1, at least as long as you're not forcing that statement to be an expression, but more about that later). The return value of empty blocks is empty, so the result of that statement is the same as +[]. The unary + operator (§11.4.6) returns ToNumber(ToPrimitive(operand)). As we already know, ToPrimitive([]) is the empty string, and according to §9.3.1, ToNumber("") is 0.

  4. {} + {}

    Similar to the previous case, the first {} is parsed as a block with empty return value. Again, +{} is the same as ToNumber(ToPrimitive({})), and ToPrimitive({}) is "[object Object]" (see [] + {}). So to get the result of +{}, we have to apply ToNumber on the string "[object Object]". When following the steps from §9.3.1, we get NaN as a result:

    If the grammar cannot interpret the String as an expansion of StringNumericLiteral, then the result of ToNumber is NaN.

  5. Array(16).join("wat" - 1)

    As per §15.4.1.1 and §15.4.2.2, Array(16) creates a new array with length 16. To get the value of the argument to join, §11.6.2 steps #5 and #6 show that we have to convert both operands to a number using ToNumber. ToNumber(1) is simply 1 (§9.3), whereas ToNumber("wat") again is NaN as per §9.3.1. Following step 7 of §11.6.2, §11.6.3 dictates that

    If either operand is NaN, the result is NaN.

    So the argument to Array(16).join is NaN. Following §15.4.4.5 (Array.prototype.join), we have to call ToString on the argument, which is "NaN" (§9.8.1):

    If m is NaN, return the String "NaN".

    Following step 10 of §15.4.4.5, we get 15 repetitions of the concatenation of "NaN" and the empty string, which equals the result you're seeing. When using "wat" + 1 instead of "wat" - 1 as argument, the addition operator converts 1 to a string instead of converting "wat" to a number, so it effectively calls Array(16).join("wat1").

As to why you're seeing different results for the {} + [] case: When using it as a function argument, you're forcing the statement to be an ExpressionStatement, which makes it impossible to parse {} as empty block, so it's instead parsed as an empty object literal.

Why can I initialize a List like an array in C#?

83 votes

Today I was surprised to find that in C# I can do:

List<int> a = new List<int> { 1, 2, 3 };

Why can I do this? What constructor is called? How can I do this with my own classes? I know that this is the way to initialize arrays but arrays are language items and Lists are simple objects ...

This is part of the collection initializer syntax in .NET. You can use this syntax on any collection you create as long as:

  • It implements IEnumerable (preferably IEnumerable<T>)

  • It has a method named Add(...)

What happens is the default constructor is called, and then Add(...) is called for each member of the initializer.

Thus, these two blocks are roughly identical:

List<int> a = new List<int> { 1, 2, 3 };

And

List<int> temp = new List<int>();
temp.Add(1);
temp.Add(2);
temp.Add(3);
List<int> a = temp;

You can call an alternate constructor if you want, for example to prevent over-sizing the List<T> during growing, etc:

// Notice, calls the List constructor that takes an int arg
// for initial capacity, then Add()'s three items.
List<int> a = new List<int>(3) { 1, 2, 3, }

Note that the Add() method need not take a single item, for example the Add() method for Dictionary<TKey, TValue> takes two items:

var grades = new Dictionary<string, int>
    {
        { "Suzy", 100 },
        { "David", 98 },
        { "Karen", 73 }
    };

Is roughly identical to:

var temp = new Dictionary<string, int>();
temp.Add("Suzy", 100);
temp.Add("David", 98);
temp.Add("Karen", 73);
var grades = temp;

So, to add this to your own class, all you need do, as mentioned, is implement IEnumerable (again, preferably IEnumerable<T>) and create one or more Add() methods:

public class SomeCollection<T> : IEnumerable<T>
{
    // implement Add() methods appropriate for your collection
    public void Add(T item)
    {
        // your add logic    
    }

    // implement your enumerators for IEnumerable<T> (and IEnumerable)
    public IEnumerator<T> GetEnumerator()
    {
        // your implementation
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

Then you can use it just like the BCL collections do:

public class MyProgram
{
    private SomeCollection<int> _myCollection = new SomeCollection<int> { 13, 5, 7 };    

    // ...
}

(For more information, see the MSDN)

How to drive C#, C++ or Java compiler to compute 1+2+3+...+1000?

69 votes

In a recent interview, I was asked a really strange question. The interviewer asked me how can I compute 1+2+3+...+1000 just using compiler features. This means that I am not allowed to write a program and execute it, but I should just write a program that could drive the compiler to compute this sum while compilation and print the result when compilation completes. As a hint, he told me that I may use generics and pre-processor features of the compiler. It is possible to use C++, C# or Java compiler. Any ideas???

This question is not related to computing the sum without any loops asked here. In addition, It should be noted that the sum SHOULD be calculated during compilation. Printing just the result using C++ compiler directives is not acceptable.

Edit

Reading more about the posted answers, I found that solving problems during compilation using C++ templates is called metaprogramming. This is a technique that was discovered accidentally by Dr. Erwin Unruh, during the process of standardizing the C++ language. You may read more about this topic on wiki page of meta-programming. It seems that it is possible to write the program in Java using java annotations. Take look at maress's answer below.

Updated Now with improved recursion depth! Works on MSVC10 and GCC without increased depth. :)


Simple compile-time recursion + addition:

template<unsigned Cur, unsigned Goal>
struct adder{
  static unsigned const sub_goal = (Cur + Goal) / 2;
  static unsigned const tmp = adder<Cur, sub_goal>::value;
  static unsigned const value = tmp + adder<sub_goal+1, Goal>::value;
};

template<unsigned Goal>
struct adder<Goal, Goal>{
  static unsigned const value = Goal;
};

Testcode:

template<unsigned Start>
struct sum_from{
  template<unsigned Goal>
  struct to{
    template<unsigned N>
    struct equals;

    typedef equals<adder<Start, Goal>::value> result;
  };
};

int main(){
  sum_from<1>::to<1000>::result();
}

Output for GCC:

error: declaration of ‘struct sum_from<1u>::to<1000u>::equals<500500u>’

Live example on Ideone.

Output for MSVC10:

error C2514: 'sum_from<Start>::to<Goal>::equals<Result>' : class has no constructors
      with
      [
          Start=1,
          Goal=1000,
          Result=500500
      ]

How can I find the missing value more concisely?

68 votes

The following code checks if x and y are distinct values (the variables x, y, z can only have values a, b, or c) and if so, sets z to the third character:

if x == 'a' and y == 'b' or x == 'b' and y == 'a':
    z = 'c'
elif x == 'b' and y == 'c' or x == 'c' and y == 'b':
    z = 'a'
elif x == 'a' and y == 'c' or x == 'c' and y == 'a':
    z = 'b'

Is is possible to do this in a more, concise, readable and efficient way?

z = (set(("a", "b", "c")) - set((x, y))).pop()

I am assuming that one of the three cases in your code holds. If this is the case, the set set(("a", "b", "c")) - set((x, y)) will consist of a single element, which is returned by pop().

Edit: As suggested by Raymond Hettinger in the comments, you could also use tuple unpacking to extract the single element from the set:

z, = set(("a", "b", "c")) - set((x, y))

How is Math.Pow() implemented in .Net Framework?

68 votes

I was looking for an efficient approach for calculating ab (say a = 2 and b = 50). To start things up, I decided to take a look at the implementation of Math.Pow() function. But in .Net reflector, all I found was this:

[MethodImpl(MethodImplOptions.InternalCall), SecuritySafeCritical]
public static extern double Pow(double x, double y);

Can anyone tell me or point me to some of the resources wherein I can see as what's going on inside when I call Math.Pow() function.

MethodImplOptions.InternalCall

That means that the method is actually implemented in the CLR, written in C++. The just-in-time compiler consults a table with internally implemented methods and compiles the call to the C++ function directly.

Having a look at the code requires the source code for the CLR. You can get that from the SSCLI20 distribution. It was written around the .NET 2.0 time frame, I've found the low-level implementations, like Math.Pow() to be still largely accurate for later versions of the CLR.

The lookup table is located in clr/src/vm/ecall.cpp. The section that's relevant to Math.Pow looks like this:

FCFuncStart(gMathFuncs)
    FCIntrinsic("Sin", COMDouble::Sin, CORINFO_INTRINSIC_Sin)
    FCIntrinsic("Cos", COMDouble::Cos, CORINFO_INTRINSIC_Cos)
    FCIntrinsic("Sqrt", COMDouble::Sqrt, CORINFO_INTRINSIC_Sqrt)
    FCIntrinsic("Round", COMDouble::Round, CORINFO_INTRINSIC_Round)
    FCIntrinsicSig("Abs", &gsig_SM_Flt_RetFlt, COMDouble::AbsFlt, CORINFO_INTRINSIC_Abs)
    FCIntrinsicSig("Abs", &gsig_SM_Dbl_RetDbl, COMDouble::AbsDbl, CORINFO_INTRINSIC_Abs)
    FCFuncElement("Exp", COMDouble::Exp)
    FCFuncElement("Pow", COMDouble::Pow)
    // etc..
FCFuncEnd()

Searching for "COMDouble" takes you to clr/src/classlibnative/float/comfloat.cpp. I'll spare you the code, just have a look for yourself. It basically checks for corner cases, then calls the CRT's version of pow().

The only other implementation detail that's interesting is the FCIntrinsic macro in the table. That's a hint that the jitter may implement the function as an intrinsic. In other words, substitute the function call with a floating point machine code instruction. Which is not the case for Pow(), there is no FPU instruction for it. But certainly for the other simple operations. Notable is that this can make floating point math in C# substantially faster than the same code in C++, check this answer for the reason why.

Btw, the source code for the CRT is also available if you have the full version of VS. vc/crt/src directory. You'll hit the wall on pow() though, Microsoft purchased that code from Intel. Doing a better job than the Intel engineers is unlikely. Although my high-school book's identity was twice as fast when I tried it:

    public static double fasterPow(double x, double y) {
        return Math.Exp(y * Math.Log(x));
    }

Java getter chaining bad or good?

54 votes

To prevent monster constructors and monster interfaces with oversized delegating classes, I use alot of classes that hold other objects which again hold other objects. Therefore my code looks like this alot.

this.mainObject.getA().getAB().getABA().getABAC().doSomething();

The style checking packages don't compain about this and metrics on loose coupling are OK. I know there is this question on method chaining, but it is not quite the same as getter chaining, on which I can find little guidance. Is my way "correct", or is there a better design pattern? Or is the truth somewhere in between and one should use chained getters only if their length is smaller than e.g. 5?

I'm looking for a best pratice style or even some style standard if such exists.

Aside from what others have said about null references, you of course have the Law of Demeter. Other answers have done a pretty good job of describing the Law of Demeter, but I think its worth clarifying some exceptions as called out in Clean Code.

According to Martin, the law of demeter doesn't apply in purely data/structural (ie "struct like") relationships. IE Student.GetAddress().GetStreetName() is perfectly acceptable. Martin actually advocates using plain structs for these kinds of relationships, but alas this is Java. The Law of Demeter however, would apply if you involved mutators, or provided writable access to an internal class with mutators ie Student.GetLastTest().ChangeGradeToA() becomes more problematic because you are probably intending to perform an operation on Student that could easily have a better name. Moreover you are tightly coupling the client of Student to whatever GetLastTest returns.

As I advocate its always advisable to cleanly separate data/structural relationships from relationships where one class is attempting to act on another.

Why does Double.NaN==Double.NaN return false?

54 votes

I was just studying OCPJP questions and I found this strange code:

public static void main(String a[]) {
    System.out.println(Double.NaN==Double.NaN);
    System.out.println(Double.NaN!=Double.NaN);
}

When I ran the code, I got:

false
true

How is the output false when we're comparing two things that look the same as each other? What does NaN mean?

NaN means "Not a Number".

Java Language Specification (JLS) says:

Floating-point operators produce no exceptions (§11). An operation that overflows produces a signed infinity, an operation that underflows produces a denormalized value or a signed zero, and an operation that has no mathematically definite result produces NaN. All numeric operations with NaN as an operand produce NaN as a result. As has already been described, NaN is unordered, so a numeric comparison operation involving one or two NaNs returns false and any != comparison involving NaN returns true, including x!=x when x is NaN.

Whats a more memory efficient way of appending to TextBox.Text during a loop?

53 votes

Short Question

I have a loop that runs 180,000 times. At the end of each iteration it is supposed to append the results to a TextBox, which is updated real-time.

Using MyTextBox.Text += someValue is causing the application to eat huge amounts of memory, and it runs out of available memory after a few thousand records.

Is there a more efficient way of appending text to a TextBox.Text 180,000 times?

Edit I really don't care about the result of this specific case, however I want to know why this seems to be a memory hog and if there is a more efficient way to append text to a TextBox


Long (Original) Question

I have a small app which reads a list of ID numbers in a CSV file and generates a PDF report for each one. After each pdf file is generated, the ResultsTextBox.Text gets appended with the ID Number of the report that got processed and that it was successfully processed. The process runs on a background thread, so the ResultsTextBox gets updated real-time as items get processed

I am currently running the app against 180,000 ID numbers, however the memory the application is taking up is growing exponentially as time goes by. It starts by around 90K, but by about 3000 records it is taking up roughly 250MB and by 4000 records the application is taking up about 500 MB of memory.

If I comment out the update to the Results TextBox, the memory stays relatively stationary at roughly 90K, so I can assume that writing ResultsText.Text += someValue is what is causing it to eat memory.

My question is, why is this? What is a better way of appending data to a TextBox.Text that doesn't eat memory?

My code looks like this:

try
{
    report.SetParameterValue("Id", id);

    report.ExportToDisk(ExportFormatType.PortableDocFormat,
        string.Format(@"{0}\{1}.pdf", new object[] { outputLocation, id}));

    // ResultsText.Text += string.Format("Exported {0}\r\n", id);
}
catch (Exception ex)
{
    ErrorsText.Text += string.Format("Failed to export {0}: {1}\r\n", 
        new object[] { id, ex.Message });
}

It should also be worth mentioning that the app is a one-time thing and it doesn't matter that it is going to take a few hours (or days :)) to generate all the reports. My main concern is that if it hits the system memory limit, it will stop running.

I'm fine with leaving the line updating the Results TextBox commented out to run this thing, but I would like to know if there is a more memory efficient way of appending data to a TextBox.Text for future projects.

I suspect the reason the memory usage is so large is because textboxes maintain a stack so that the user can undo/redo text. That feature doesn't seem to be required in your case, so try setting IsUndoEnabled to false.

How bad is "if (!this)" in a C++ member function?

53 votes

If I come across old code that does if (!this) return; in an app, how severe a risk is this? Is it a dangerous ticking time bomb that requires an immediate app-wide search and destroy effort, or is it more like a code smell that can be quietly left in place?

I am not planning on writing code that does this, of course. Rather, I've recently discovered something in an old core library used by many pieces of our app.

Imagine a CLookupThingy class has a non-virtual CThingy *CLookupThingy::Lookup( name ) member function. Apparently one of the programmers back in those cowboy days encountered many crashes where NULL CLookupThingy *s were being passed from functions, and rather than fixing hundreds of call sites, he quietly fixed up Lookup():

CThingy *CLookupThingy::Lookup( name ) 
{
   if (!this)
   {
      return NULL;
   }
   // else do the lookup code...
}

// now the above can be used like
CLookupThingy *GetLookup() 
{
  if (notReady()) return NULL;
  // else etc...
}

CThingy *pFoo = GetLookup()->Lookup( "foo" ); // will set pFoo to NULL without crashing

I discovered this gem earlier this week, but now am conflicted as to whether I ought to fix it. This is in a core library used by all of our apps. Several of those apps have already been shipped to millions of customers, and it seems to be working fine; there are no crashes or other bugs from that code. Removing the if !this in the lookup function will mean fixing thousands of call sites that potentially pass NULL; inevitably some will be missed, introducing new bugs that will pop up randomly over the next year of development.

So I'm inclined to leave it alone, unless absolutely necessary.

Given that it is technically undefined behavior, how dangerous is if (!this) in practice? Is it worth man-weeks of labor to fix, or can MSVC and GCC be counted on to safely return?

Our app compiles on MSVC and GCC, and runs on Windows, Ubuntu, and MacOS. Portability to other platforms is irrelevant. The function in question is guaranteed to never be virtual.

Edit: The kind of objective answer I am looking for is something like

  • "Current versions of MSVC and GCC use an ABI where nonvirtual members are really statics with an implicit 'this' parameter; therefore they will safely branch into the function even if 'this' is NULL" or
  • "a forthcoming version of GCC will change the ABI so that even nonvirtual functions require loading a branch target from the class pointer" or
  • "the current GCC 4.5 has an inconsistent ABI where sometimes it compiles nonvirtual members as direct branches with an implicit parameter, and sometimes as class-offset function pointers."

The former means the code is stinky but unlikely to break; the second is something to test after a compiler upgrade; the latter requires immediate action even at high cost.

Clearly this is a latent bug waiting to happen, but right now I'm only concerned with mitigating risk on our specific compilers.

I would leave it alone. This might have been a deliberate choice as an old-fashioned version of the SafeNavigationOperator. As you say, in new code, I wouldn't recommend it, but for existing code, I'd leave it alone. If you do end up modifying it, I'd make sure that all calls to it are well-covered by tests.

Edit to add: you could choose to remove it only in debug versions of your code via:

CThingy *CLookupThingy::Lookup( name ) 
{
#if !defined(DEBUG)
   if (!this)
   {
      return NULL;
   }
#endif
   // else do the lookup code...
}

Thus, it wouldn't break anything on production code, while giving you a chance to test it in debug mode.

Strange behavior when casting a float to int in C#

50 votes

I have the following simple code :

int speed1 = (int)(6.2f * 10);
float tmp = 6.2f * 10;
int speed2 = (int)tmp;

speed1 and speed2 should have the same value, but in fact, I have :

speed1 = 61
speed2 = 62

I know I should probably use Math.Round instead of casting, but I'd like to understand why the values are different.

I looked at the generated bytecode, but except a store and a load, the opcodes are the same.

I also tried the same code in java, and I correctly obtain 62 and 62.

Can someone explain this ?

Edit : In the real code, it's not directly 6.2f * 10 but a function call * a constant. I have the following bytecode :

for speed 1 :

IL_01b3:  ldloc.s    V_8
IL_01b5:  callvirt   instance float32 myPackage.MyClass::getSpeed()
IL_01ba:  ldc.r4     10.
IL_01bf:  mul
IL_01c0:  conv.i4
IL_01c1:  stloc.s    V_9

for speed 2 :

IL_01c3:  ldloc.s    V_8
IL_01c5:  callvirt   instance float32 myPackage.MyClass::getSpeed()
IL_01ca:  ldc.r4     10.
IL_01cf:  mul
IL_01d0:  stloc.s    V_10
IL_01d2:  ldloc.s    V_10
IL_01d4:  conv.i4
IL_01d5:  stloc.s    V_11

we can see that operands are floats and that the only difference is the stloc/ldloc

As for the virtual machine, I tried with Mono/Win7, Mono/MacOS, and .NET/Windows, with the same results

First of all, I assume that you know that 6.2f * 10 is not exactly 62 due to floating point rounding (it's actually the value 61.99999809265137 when expressed as a double) and that your question is only about why two seemingly identical computations result in the wrong value.

The answer is that in the case of (int)(6.2f * 10), you are taking the double value 61.99999809265137 and truncating it to an integer, which yields 61.

In the case of float f = 6.2f * 10, you are taking the double value 61.99999809265137 and rounding to the nearest float, which is 62. You then truncate that float to an integer, and the result is 62.

Exercise: Explain the results of the following sequence of operations.

double d = 6.2f * 10;
int tmp2 = (int)d;
// evaluate tmp2

Update: As noted in the comments, the expression 6.2f * 10 is formally a float since the second parameter has an implicit conversion to float which is better than the implicit conversion to double.

The actual issue is that the compiler is permitted (but not required) to use an intermediate which is higher precision than the formal type. That's why you see different behavior on different systems: In the expression (int)(6.2f * 10), the compiler has the option of keeping the value 6.2f * 10 in a high precision intermediate form before converting to int. If it does, then the result is 61. If it does not, then the result is 62.

In the second example, the explicit assignment to float forces the rounding to take place before the conversion to integer.

C# optional parameters on overridden methods

44 votes

Seems like in .NET Framework there is an issue with optional parameters when you override the method. The output of the code below is: "bbb" "aaa" . But the output I'm expecting is: "bbb" "bbb" .Is there a solution for this. I know it can be solved with method overloading but wondering the reason for this. Also the code works fine in Mono.

class Program
{
    class AAA
    {
        public virtual void MyMethod(string s = "aaa")
        {
            Console.WriteLine(s);
        }

        public virtual void MyMethod2()
        {
            MyMethod();
        }
    }

    class BBB : AAA
    {
        public override void MyMethod(string s = "bbb")
        {
            base.MyMethod(s);
        }

        public override void MyMethod2()
        {
            MyMethod();
        }
    }

    static void Main(string[] args)
    {
        BBB asd = new BBB();
        asd.MyMethod();
        asd.MyMethod2();
    }
}

One thing worth noting here, is that the overridden version is called each time. Change the override to:

public override void MyMethod(string s = "bbb")
{
  Console.Write("derived: ");
  base.MyMethod(s);
}

And the output is:

derived: bbb
derived: aaa

A method in a class can do one or two of the following:

  1. It defines an interface for other code to call.
  2. It defines an implementation to execute when called.

It may not do both, as an abstract method does only the former.

Within BBB the call MyMethod() calls a method defined in AAA.

Because there is an override in BBB, calling that method results in an implementation in BBB being called.

Now, the definition in AAA informs calling code of two things (well, a few others too that don't matter here).

  1. The signature void MyMethod(string).
  2. (For those languages that support it) the default value for the single parameter is "aaa" and therefore when compiling code of the form MyMethod() if no method matching MyMethod() can be found, you may replace it with a call to `MyMethod("aaa").

So, that's what the call in BBB does: The compiler sees a call to MyMethod(), doesn't find a method MyMethod() but does find a method MyMethod(string). It also sees that at the place where it is defined there's a default value of "aaa", so at compile time it changes this to a call to MyMethod("aaa").

From within BBB, AAA is considered the place where AAA's methods are defined, even if overridden in BBB, so that they can be over-ridden.

At run-time, MyMethod(string) is called with the argument "aaa". Because there is a overridden form, that is the form called, but it is not called with "bbb" because that value has nothing to do with the run-time implementation but with the compile-time definition.

Adding this. changes which definition is examined, and so changes what argument is used in the call.

Edit: Why this seems more intuitive to me.

Personally, and since I'm talking of what is intuitive it can only be personal, I find this more intuitive for the following reason:

If I was coding BBB then whether calling or overriding MyMethod(string), I'd think of that as "doing AAA stuff" - it's BBBs take on "doing AAA stuff", but it's doing AAA stuff all the same. Hence whether calling or overriding, I'm going to be aware of the fact that it was AAA that defined MyMethod(string).

If I was calling code that used BBB, I'd think of "using BBB stuff". I might not be very aware of which was originally defined in AAA, and I'd perhaps think of this as merely an implementation detail (if I didn't also use the AAA interface nearby).

The compiler's behaviour matches my intuition, which is why when first reading the question it seemed to me that Mono had a bug. Upon consideration, I can't see how either fulfils the specified behaviour better than the other.

For that matter though, while remaining at a personal level, I'd never use optional parameters with abstract, virtual or overridden methods, and if overriding someone else's that did, I'd match theirs.

Why does 0.ToString("#.##") return an empty string instead of 0.00 or at least 0?

43 votes

Why does 0.ToString("#.##") return an empty string? Shouldn't it be 0.00 or at least 0?

# in the string format indicate that the value is optional. If you wish to get the output 0.00 you need the following:

0.ToString("0.00");

See here for the custom numeric formats that can be passed to this method.

Is the pImpl idiom really used in practice?

41 votes

I am reading the book "Exceptional C++" by Herb Sutter, and in that book I have learned about the pImpl idiom. Basically, the idea is to create an structure for the private objects of a class and dynamically allocate them to decrease the compilation time (and also hide the private implementations in a better manner).

For example:

class X
{
private:
  C c;
  D d;  
} ;

could be changed to:

class X
{
private:
  struct XImpl;
  XImpl* pImpl;       
};

and, in the CPP, the definition:

struct X::XImpl
{
  C c;
  D d;
};

This seems pretty interesting, but I have never seen this kind of approach before, neither in the companies I have worked, nor in open source projects that I've seen the source code. So, I am wondering it this technique is really used in practice?

Should I use it everywhere, or with caution? And is this technique recommended to be used in embedded systems (where the performance is very important)?

So, I am wondering it this technique is really used in practice? Should I use it everywhere, or with caution?

Of course it is used, and in my project, in almost every class, for two reasons you mentioned :

  • data hiding
  • recompilation time is really decreased, since only the source file needs to be rebuild, but not the header, and every file that includes it
  • binary compatibility. Since the class declaration doesn't change, it is safe to just update the library (assuming you are creating a library)

is this technique recommended to be used in embedded systems (where the performance is very important)?

That depends on how powerful your target is. However the only answer to this question is : measure and evaluate what you get and lose.

What is the effect of "list=list" in Python modules?

38 votes

I've seen following code in the python standard library /usr/lib/python2.7/multiprocessing/dummy/__init__.py:

list = list
dict = dict

What does this idiom mean? My best guess is: "let's check if dict and list exist". Is it just legacy code from the ancient times without list and dict in the __builtins__?

And I have another mad guess: optimization of lookup speed moving list from global scope to module scope. Is it sane assumption regarding the idiom? I see, that the assumption is wrong if I apply it to multiprocessing.

Exports. You then can do:

from multiprocessing.dummy import list

... which happens to be the regular list.

Without that line, there would be no list in the package multiprocessing.dummy.

This is sensible to have a uniform API across packages. Say all packages are supposed to offer a list class. Package a chooses to provide a custom implementation, package b however wants to use the list from __builtins__.

powerful/__init__.py:
from powerfulinternals import PowerfulList as list
from simple.simpleinternals import Something as whoo

simple/__init__.py:
list = list
from simpleinternals import Something as whoo

application.py:
try:
  import powerful as api
else:
  import simple as api

mylist = api.list()
woot = api.whoo()

There more reason to do such things. For example to make it explicit what you are using.

list = list

can also be seen as a statement "if you want to change the type of lists I'm using, change it here."

In this particular case, it is the former. The list and dict are exposed as:

manager = multiprocessing.dummy.Manager()
l = manager.list()
d = manager.dict()

And the definition of Manager is:

def Manager():
  return sys.modules[__name__]

i.e. Manager.list = list.

C++ vs Java? Why does the ICC generate slower code than VC?

36 votes

The following is a simple loop in C++. The timer is using QueryPerformanceCounter() and is quite accurate. I found Java to take 60% of the time C++ takes and this can't be?! What am I doing wrong here? Even strict aliasing (which is not included in the code here) doesn't help at all...

long long var = 0;
std::array<int, 1024> arr;
int* arrPtr = arr.data();
CHighPrecisionTimer timer;

for(int i = 0; i < 1024; i++) arrPtr[i] = i;

timer.Start();

for(int i = 0; i < 1024 * 1024 * 10; i++){
    for(int x = 0; x < 1024; x++){
        var += arrPtr[x];
    }
}

timer.Stop();

printf("Unrestricted: %lld us, Value = %lld\n", (Int64)timer.GetElapsed().GetMicros(), var);

This C++ runs through in about 9.5 seconds. I am using the Intel Compiler 12.1 with host processor optimization (specifically for mine) and everything maxed. So this is Intel Compiler at its best! Auto-Parallelization funnily consumes 70% CPU instead of 25% but doesn't get the job done any faster ;)...

Now I use the following Java code for comparison:

    long var = 0;
    int[] arr = new int[1024];

    for(int i = 0; i < 1024; i++) arr[i] = i;

    for(int i = 0; i < 1024 * 1024; i++){
        for(int x = 0; x < 1024; x++){
            var += arr[x];
        }
    }

    long nanos = System.nanoTime();

    for(int i = 0; i < 1024 * 1024 * 10; i++){
        for(int x = 0; x < 1024; x++){
            var += arr[x];
        }
    }

    nanos = (System.nanoTime() - nanos) / 1000;

    System.out.print("Value: " + var + ", Time: " + nanos);

The Java code is invoked with aggressive optimization and the server VM (no debug). It runs in about 7 seconds on my machine (only uses one thread).

Is this a failure of the Intel Compiler or am I just too dumb again?

[EDIT]: Ok now heres the thing... Seems more like a bug in the Intel compiler ^^. [Please note that I am running on the Intel Quadcore Q6600, which is rather old. And it might be that the Intel Compiler performs way better on recent CPUs, like Core i7]

Intel x86 (without vectorization): 3 seconds
MSVC x64: 5 seconds
Java x86/x64 (Oracle Java 7): 7 seconds
Intel x64 (with vectorization): 9.5 seconds
Intel x86 (with vectorization): 9.5 seconds
Intel x64 (without vectorization): 12 seconds
MSVC x86: 15 seconds (uhh)

[EDIT]: Another nice case ;). Consider the following trivial lambda expression

#include <stdio.h>
#include <tchar.h>
#include <Windows.h>
#include <vector>
#include <boost/function.hpp>
#include <boost/lambda/bind.hpp>
#include <boost/typeof/typeof.hpp>

template<class TValue>
struct ArrayList
{
private:
    std::vector<TValue> m_Entries;
public:

    template<class TCallback>
    void Foreach(TCallback inCallback)
    {
        for(int i = 0, size = m_Entries.size(); i < size; i++)
        {
            inCallback(i);
        }
    }

    void Add(TValue inValue)
    {
        m_Entries.push_back(inValue);
    }
};

int _tmain(int argc, _TCHAR* argv[])
{
    auto t = [&]() {};


    ArrayList<int> arr;
    int res = 0;

    for(int i = 0; i < 100; i++)
    {
        arr.Add(i);
    }

    long long freq, t1, t2;

    QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
    QueryPerformanceCounter((LARGE_INTEGER*)&t1);

    for(int i = 0; i < 1000 * 1000 * 10; i++)
    {
        arr.Foreach([&](int v) {
            res += i;
        });
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);

    printf("Time: %lld\n", ((t2-t1) * 1000000) / freq);

    if(res == 4950)
        return -1;

    return 0;
}

Intel compiler shines again:

MSVC x86/x64: 12 milli seconds
Intel x86/x64: 1 second

Uhm?! Well, I guess 90 times slower is not a bad thing...

I am not really sure anymore that this applies: Okay and based on an answer to this thread: The intel compiler is known (and I knew that too but I just didn't think about that they could drop support for their processors) to have terrible performance on processors which are not "known" to the compiler, like AMD processors, and maybe even outdated Intel processors like mine... So if someone with a recent Intel processor could try this out it would be nice ;).

Here is the x64 output of the Intel Compiler:

    std::array<int, 1024> arr;
    int* arrPtr = arr.data();
    QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
000000013F05101D  lea         rcx,[freq]  
000000013F051022  call        qword ptr [__imp_QueryPerformanceFrequency (13F052000h)]  

    for(int i = 0; i < 1024; i++) arrPtr[i] = i;
000000013F051028  mov         eax,4  
000000013F05102D  movd        xmm0,eax  
000000013F051031  xor         eax,eax  
000000013F051033  pshufd      xmm1,xmm0,0  
000000013F051038  movdqa      xmm0,xmmword ptr [__xi_z+28h (13F0521A0h)]  
000000013F051040  movdqa      xmmword ptr arr[rax*4],xmm0  
000000013F051046  paddd       xmm0,xmm1  
000000013F05104A  movdqa      xmmword ptr [rsp+rax*4+60h],xmm0  
000000013F051050  paddd       xmm0,xmm1  
000000013F051054  movdqa      xmmword ptr [rsp+rax*4+70h],xmm0  
000000013F05105A  paddd       xmm0,xmm1  
000000013F05105E  movdqa      xmmword ptr [rsp+rax*4+80h],xmm0  
000000013F051067  add         rax,10h  
000000013F05106B  paddd       xmm0,xmm1  
000000013F05106F  cmp         rax,400h  
000000013F051075  jb          wmain+40h (13F051040h)  

    QueryPerformanceCounter((LARGE_INTEGER*)&t1);
000000013F051077  lea         rcx,[t1]  
000000013F05107C  call        qword ptr [__imp_QueryPerformanceCounter (13F052008h)]  
            var += arrPtr[x];
000000013F051082  movdqa      xmm1,xmmword ptr [__xi_z+38h (13F0521B0h)]  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
000000013F05108A  xor         eax,eax  
            var += arrPtr[x];
000000013F05108C  movdqa      xmm0,xmmword ptr [__xi_z+48h (13F0521C0h)]  
    long long var = 0, freq, t1, t2;
000000013F051094  pxor        xmm6,xmm6  
        for(int x = 0; x < 1024; x++){
000000013F051098  xor         r8d,r8d  
            var += arrPtr[x];
000000013F05109B  lea         rdx,[arr]  
000000013F0510A0  xor         ecx,ecx  
000000013F0510A2  movq        xmm2,mmword ptr arr[rcx]  
        for(int x = 0; x < 1024; x++){
000000013F0510A8  add         r8,8  
            var += arrPtr[x];
000000013F0510AC  punpckldq   xmm2,xmm2  
        for(int x = 0; x < 1024; x++){
000000013F0510B0  add         rcx,20h  
            var += arrPtr[x];
000000013F0510B4  movdqa      xmm3,xmm2  
000000013F0510B8  pand        xmm2,xmm0  
000000013F0510BC  movq        xmm4,mmword ptr [rdx+8]  
000000013F0510C1  psrad       xmm3,1Fh  
000000013F0510C6  punpckldq   xmm4,xmm4  
000000013F0510CA  pand        xmm3,xmm1  
000000013F0510CE  por         xmm3,xmm2  
000000013F0510D2  movdqa      xmm5,xmm4  
000000013F0510D6  movq        xmm2,mmword ptr [rdx+10h]  
000000013F0510DB  psrad       xmm5,1Fh  
000000013F0510E0  punpckldq   xmm2,xmm2  
000000013F0510E4  pand        xmm5,xmm1  
000000013F0510E8  paddq       xmm6,xmm3  
000000013F0510EC  pand        xmm4,xmm0  
000000013F0510F0  movdqa      xmm3,xmm2  
000000013F0510F4  por         xmm5,xmm4  
000000013F0510F8  psrad       xmm3,1Fh  
000000013F0510FD  movq        xmm4,mmword ptr [rdx+18h]  
000000013F051102  pand        xmm3,xmm1  
000000013F051106  punpckldq   xmm4,xmm4  
000000013F05110A  pand        xmm2,xmm0  
000000013F05110E  por         xmm3,xmm2  
000000013F051112  movdqa      xmm2,xmm4  
000000013F051116  paddq       xmm6,xmm5  
000000013F05111A  psrad       xmm2,1Fh  
000000013F05111F  pand        xmm4,xmm0  
000000013F051123  pand        xmm2,xmm1  
        for(int x = 0; x < 1024; x++){
000000013F051127  add         rdx,20h  
            var += arrPtr[x];
000000013F05112B  paddq       xmm6,xmm3  
000000013F05112F  por         xmm2,xmm4  
        for(int x = 0; x < 1024; x++){
000000013F051133  cmp         r8,400h  
            var += arrPtr[x];
000000013F05113A  paddq       xmm6,xmm2  
        for(int x = 0; x < 1024; x++){
000000013F05113E  jb          wmain+0A2h (13F0510A2h)  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
000000013F051144  inc         eax  
000000013F051146  cmp         eax,0A00000h  
000000013F05114B  jb          wmain+98h (13F051098h)  
        }
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);
000000013F051151  lea         rcx,[t2]  
000000013F051156  call        qword ptr [__imp_QueryPerformanceCounter (13F052008h)]  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013F05115C  mov         r9,qword ptr [t2]  
    long long var = 0, freq, t1, t2;
000000013F051161  movdqa      xmm0,xmm6  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013F051165  sub         r9,qword ptr [t1]  
000000013F05116A  lea         rcx,[string "Unrestricted: %lld ms, Value = %"... (13F0521D0h)]  
000000013F051171  imul        rax,r9,3E8h  
000000013F051178  cqo  
000000013F05117A  mov         r10,qword ptr [freq]  
000000013F05117F  idiv        rax,r10  
    long long var = 0, freq, t1, t2;
000000013F051182  psrldq      xmm0,8  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013F051187  mov         rdx,rax  
    long long var = 0, freq, t1, t2;
000000013F05118A  paddq       xmm6,xmm0  
000000013F05118E  movd        r8,xmm6  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013F051193  call        qword ptr [__imp_printf (13F052108h)]  

And this one is the assembly of the MSVC x64 build:

int _tmain(int argc, _TCHAR* argv[])
{
000000013FF61000  push        rbx  
000000013FF61002  mov         eax,1050h  
000000013FF61007  call        __chkstk (13FF61950h)  
000000013FF6100C  sub         rsp,rax  
000000013FF6100F  mov         rax,qword ptr [__security_cookie (13FF63000h)]  
000000013FF61016  xor         rax,rsp  
000000013FF61019  mov         qword ptr [rsp+1040h],rax  
    long long var = 0, freq, t1, t2;
    std::array<int, 1024> arr;
    int* arrPtr = arr.data();
    QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
000000013FF61021  lea         rcx,[rsp+28h]  
000000013FF61026  xor         ebx,ebx  
000000013FF61028  call        qword ptr [__imp_QueryPerformanceFrequency (13FF62000h)]  

    for(int i = 0; i < 1024; i++) arrPtr[i] = i;
000000013FF6102E  xor         r11d,r11d  
000000013FF61031  lea         rax,[rsp+40h]  
000000013FF61036  mov         dword ptr [rax],r11d  
000000013FF61039  inc         r11d  
000000013FF6103C  add         rax,4  
000000013FF61040  cmp         r11d,400h  
000000013FF61047  jl          wmain+36h (13FF61036h)  

    QueryPerformanceCounter((LARGE_INTEGER*)&t1);
000000013FF61049  lea         rcx,[rsp+20h]  
000000013FF6104E  call        qword ptr [__imp_QueryPerformanceCounter (13FF62008h)]  
000000013FF61054  mov         r11d,0A00000h  
000000013FF6105A  nop         word ptr [rax+rax]  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
        for(int x = 0; x < 1024; x++){
000000013FF61060  xor         edx,edx  
000000013FF61062  xor         r8d,r8d  
000000013FF61065  lea         rcx,[rsp+48h]  
000000013FF6106A  xor         r9d,r9d  
000000013FF6106D  mov         r10d,100h  
000000013FF61073  nop         word ptr [rax+rax]  
            var += arrPtr[x];
000000013FF61080  movsxd      rax,dword ptr [rcx-8]  
000000013FF61084  add         rcx,10h  
000000013FF61088  add         rbx,rax  
000000013FF6108B  movsxd      rax,dword ptr [rcx-14h]  
000000013FF6108F  add         r9,rax  
000000013FF61092  movsxd      rax,dword ptr [rcx-10h]  
000000013FF61096  add         r8,rax  
000000013FF61099  movsxd      rax,dword ptr [rcx-0Ch]  
000000013FF6109D  add         rdx,rax  
000000013FF610A0  dec         r10  
000000013FF610A3  jne         wmain+80h (13FF61080h)  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
        for(int x = 0; x < 1024; x++){
000000013FF610A5  lea         rax,[rdx+r8]  
000000013FF610A9  add         rax,r9  
000000013FF610AC  add         rbx,rax  
000000013FF610AF  dec         r11  
000000013FF610B2  jne         wmain+60h (13FF61060h)  
        }
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);
000000013FF610B4  lea         rcx,[rsp+30h]  
000000013FF610B9  call        qword ptr [__imp_QueryPerformanceCounter (13FF62008h)]  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013FF610BF  mov         rax,qword ptr [rsp+30h]  
000000013FF610C4  lea         rcx,[string "Unrestricted: %lld ms, Value = %"... (13FF621B0h)]  
000000013FF610CB  sub         rax,qword ptr [rsp+20h]  
000000013FF610D0  mov         r8,rbx  
000000013FF610D3  imul        rax,rax,3E8h  
000000013FF610DA  cqo  
000000013FF610DC  idiv        rax,qword ptr [rsp+28h]  
000000013FF610E1  mov         rdx,rax  
000000013FF610E4  call        qword ptr [__imp_printf (13FF62138h)]  

    return 0;
000000013FF610EA  xor         eax,eax  

Intel Compiler configured without Vectorization, 64-Bit, highest optimizations (this is surprisingly slow, 12 seconds):

000000013FC0102F  lea         rcx,[freq]  

    double var = 0; long long freq, t1, t2;
000000013FC01034  xorps       xmm6,xmm6  
    std::array<double, 1024> arr;
    double* arrPtr = arr.data();
    QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
000000013FC01037  call        qword ptr [__imp_QueryPerformanceFrequency (13FC02000h)]  

    for(int i = 0; i < 1024; i++) arrPtr[i] = i;
000000013FC0103D  mov         eax,2  
000000013FC01042  mov         rdx,100000000h  
000000013FC0104C  movd        xmm0,eax  
000000013FC01050  xor         eax,eax  
000000013FC01052  pshufd      xmm1,xmm0,0  
000000013FC01057  movd        xmm0,rdx  
000000013FC0105C  nop         dword ptr [rax]  
000000013FC01060  cvtdq2pd    xmm2,xmm0  
000000013FC01064  paddd       xmm0,xmm1  
000000013FC01068  cvtdq2pd    xmm3,xmm0  
000000013FC0106C  paddd       xmm0,xmm1  
000000013FC01070  cvtdq2pd    xmm4,xmm0  
000000013FC01074  paddd       xmm0,xmm1  
000000013FC01078  cvtdq2pd    xmm5,xmm0  
000000013FC0107C  movaps      xmmword ptr arr[rax*8],xmm2  
000000013FC01081  paddd       xmm0,xmm1  
000000013FC01085  movaps      xmmword ptr [rsp+rax*8+60h],xmm3  
000000013FC0108A  movaps      xmmword ptr [rsp+rax*8+70h],xmm4  
000000013FC0108F  movaps      xmmword ptr [rsp+rax*8+80h],xmm5  
000000013FC01097  add         rax,8  
000000013FC0109B  cmp         rax,400h  
000000013FC010A1  jb          wmain+60h (13FC01060h)  

    QueryPerformanceCounter((LARGE_INTEGER*)&t1);
000000013FC010A3  lea         rcx,[t1]  
000000013FC010A8  call        qword ptr [__imp_QueryPerformanceCounter (13FC02008h)]  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
000000013FC010AE  xor         eax,eax  
        for(int x = 0; x < 1024; x++){
000000013FC010B0  xor         edx,edx  
            var += arrPtr[x];
000000013FC010B2  lea         ecx,[rdx+rdx]  
        for(int x = 0; x < 1024; x++){
000000013FC010B5  inc         edx  
        for(int x = 0; x < 1024; x++){
000000013FC010B7  cmp         edx,200h  
            var += arrPtr[x];
000000013FC010BD  addsd       xmm6,mmword ptr arr[rcx*8]  
000000013FC010C3  addsd       xmm6,mmword ptr [rsp+rcx*8+58h]  
        for(int x = 0; x < 1024; x++){
000000013FC010C9  jb          wmain+0B2h (13FC010B2h)  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
000000013FC010CB  inc         eax  
000000013FC010CD  cmp         eax,0A00000h  
000000013FC010D2  jb          wmain+0B0h (13FC010B0h)  
        }
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);
000000013FC010D4  lea         rcx,[t2]  
000000013FC010D9  call        qword ptr [__imp_QueryPerformanceCounter (13FC02008h)]  

Intel Compiler without vectorization, 32-Bit and highest optimization (this one clearly is the winner now, runs in about 3 seconds and the assembly looks much better):

00B81088  lea         eax,[t1]  
00B8108C  push        eax  
00B8108D  call        dword ptr [__imp__QueryPerformanceCounter@4 (0B82004h)]  
00B81093  xor         eax,eax  
00B81095  pxor        xmm0,xmm0  
00B81099  movaps      xmm1,xmm0  
        for(int x = 0; x < 1024; x++){
00B8109C  xor         edx,edx  
            var += arrPtr[x];
00B8109E  addpd       xmm0,xmmword ptr arr[edx*8]  
00B810A4  addpd       xmm1,xmmword ptr [esp+edx*8+40h]  
00B810AA  addpd       xmm0,xmmword ptr [esp+edx*8+50h]  
00B810B0  addpd       xmm1,xmmword ptr [esp+edx*8+60h]  
        for(int x = 0; x < 1024; x++){
00B810B6  add         edx,8  
00B810B9  cmp         edx,400h  
00B810BF  jb          wmain+9Eh (0B8109Eh)  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
00B810C1  inc         eax  
00B810C2  cmp         eax,0A00000h  
00B810C7  jb          wmain+9Ch (0B8109Ch)  

    double var = 0; long long freq, t1, t2;
00B810C9  addpd       xmm0,xmm1  
        }
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);
00B810CD  lea         eax,[t2]  
00B810D1  push        eax  
00B810D2  movaps      xmmword ptr [esp+4],xmm0  
00B810D7  call        dword ptr [__imp__QueryPerformanceCounter@4 (0B82004h)]  
00B810DD  movaps      xmm0,xmmword ptr [esp]

tl;dr: What you're seeing here seems to be ICC's failed attempt at vectorizing the loop.

Let's start with MSVC x64:

Here's the critical loop:

$LL3@main:
movsxd  rax, DWORD PTR [rdx-4]
movsxd  rcx, DWORD PTR [rdx-8]
add rdx, 16
add r10, rax
movsxd  rax, DWORD PTR [rdx-16]
add rbx, rcx
add r9, rax
movsxd  rax, DWORD PTR [rdx-12]
add r8, rax
dec r11
jne SHORT $LL3@main

What you see here is the standard loop unrolling by the compiler. MSVC is unrolling to 4 iterations, and splitting the var variable across four registers: r10, rbx, r9, and r8. Then at the end of the loop, these 4 registers are summed up back together.

Here's where the 4 sums are recombined:

lea rax, QWORD PTR [r8+r9]
add rax, r10
add rbx, rax
dec rdi
jne SHORT $LL6@main

Note that MSVC currently does not do automatic vectorization.


Now let's look at part of your ICC output:

000000013F0510A2  movq        xmm2,mmword ptr arr[rcx]  
000000013F0510A8  add         r8,8  
000000013F0510AC  punpckldq   xmm2,xmm2  
000000013F0510B0  add         rcx,20h  
000000013F0510B4  movdqa      xmm3,xmm2  
000000013F0510B8  pand        xmm2,xmm0  
000000013F0510BC  movq        xmm4,mmword ptr [rdx+8]  
000000013F0510C1  psrad       xmm3,1Fh  
000000013F0510C6  punpckldq   xmm4,xmm4  
000000013F0510CA  pand        xmm3,xmm1  
000000013F0510CE  por         xmm3,xmm2  
000000013F0510D2  movdqa      xmm5,xmm4  
000000013F0510D6  movq        xmm2,mmword ptr [rdx+10h]  
000000013F0510DB  psrad       xmm5,1Fh  
000000013F0510E0  punpckldq   xmm2,xmm2  
000000013F0510E4  pand        xmm5,xmm1  
000000013F0510E8  paddq       xmm6,xmm3  

...

What you're seeing here is an attempt by ICC to vectorize this loop. This is done in a similar manner as what MSVC did (splitting into multiple sums), but using SSE registers instead and with two sums per register.

But it turns out that the overhead of vectorization happens to outweigh the benefits of vectorizing.

If we walk these instructions down one-by-one, we can see how ICC tries to vectorize it:

//  Load two ints using a 64-bit load.  {x, y, 0, 0}
movq        xmm2,mmword ptr arr[rcx]  

//  Shuffle the data into this form.
punpckldq   xmm2,xmm2           xmm2 = {x, x, y, y}
movdqa      xmm3,xmm2           xmm3 = {x, x, y, y}

//  Mask out index 1 and 3.
pand        xmm2,xmm0           xmm2 = {x, 0, y, 0}

//  Arithmetic right-shift to copy sign-bit across the word.
psrad       xmm3,1Fh            xmm3 = {sign(x), sign(x), sign(y), sign(y)}

//  Mask out index 0 and 2.
pand        xmm3,xmm1           xmm3 = {0, sign(x), 0, sign(y)}

//  Combine to get sign-extended values.
por         xmm3,xmm2           xmm3 = {x, sign(x), y, sign(y)}
                                xmm3 = {x, y}

//  Add to accumulator...
paddq       xmm6,xmm3

So it's doing some very messy unpacking just to vectorize. The mess comes from needing to sign-extend the 32-bit integers to 64-bit using only SSE instructions.

SSE4.1 actually provides the PMOVSXDQ instruction for this purpose. But either the target machine doesn't support SSE4.1, or ICC isn't smart enough to use it in this case.

But the point is:

The Intel compiler is trying to vectorize the loop. But the overhead added seems to outweigh the benefit of vectorizing it in the first place. Hence why it's slower.


EDIT : Update with OP's results on:

  • ICC x64 no vectorization
  • ICC x86 with vectorization

You changed the data-type to double. So now it's floating-point. There's no more of that ugly sign-fill shifts that were plaguing the integer version.

But since you disabled vectorization for the x64 version, it obviously becomes slower.

ICC x86 with vectorization:

00B8109E  addpd       xmm0,xmmword ptr arr[edx*8]  
00B810A4  addpd       xmm1,xmmword ptr [esp+edx*8+40h]  
00B810AA  addpd       xmm0,xmmword ptr [esp+edx*8+50h]  
00B810B0  addpd       xmm1,xmmword ptr [esp+edx*8+60h]  
00B810B6  add         edx,8  
00B810B9  cmp         edx,400h  
00B810BF  jb          wmain+9Eh (0B8109Eh)  

Not much here - standard vectorization + 4x loop-unrolling.

ICC x64 with no vectorization:

000000013FC010B2  lea         ecx,[rdx+rdx]  
000000013FC010B5  inc         edx  
000000013FC010B7  cmp         edx,200h  
000000013FC010BD  addsd       xmm6,mmword ptr arr[rcx*8]  
000000013FC010C3  addsd       xmm6,mmword ptr [rsp+rcx*8+58h]  
000000013FC010C9  jb          wmain+0B2h (13FC010B2h)  

No vectorization + only 2x loop-unrolling.

All things equal, disabling vectorization will hurt performance in this floating-point case.

Memory leak C++

35 votes

I just wrote a code in C++ which does some string manipulation, but when I ran valgrind over, it shows some possible memory leaks. Debugging the code to granular level I wrote a simple C++ program looking like:

#include<iostream>
using namespace std;
int main()
{
        std::string myname("Is there any leaks");
        exit(0);
}

and running valgrind over it I got:

==20943== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 26 from 1)
==20943== malloc/free: in use at exit: 360,645 bytes in 12,854 blocks.
==20943== malloc/free: 65,451 allocs, 52,597 frees, 2,186,968 bytes allocated.
==20943== For counts of detected errors, rerun with: -v
==20943== searching for pointers to 12,854 not-freed blocks.
==20943== checked 424,628 bytes.
==20943== 
==20943== LEAK SUMMARY:
==20943==    definitely lost: 0 bytes in 0 blocks.
==20943==      possibly lost: 917 bytes in 6 blocks.
==20943==    still reachable: 359,728 bytes in 12,848 blocks.
==20943==         suppressed: 0 bytes in 0 blocks.
==20943== Reachable blocks (those to which a pointer was found) are not shown.
==20943== To see them, rerun with: --show-reachable=yes

Then it struck me that we have forcefully exited (which i performed in my original C++ code as well). Now the problem is that I want to exit from the program as my previous old code waits for the exit status of the new code. For e.g binary a.out waits for the exit status of b.out. Is there any way to avoid the memory leaks, or should i really worry about the memory leaks as the program is already exiting at that point.

This also raise another question for me, is such a code harmful?

#include<stdio.h>
int main()
{
        char *p=(char *)malloc(sizeof(char)*1000);
        exit(0);
}

EDIT : I cannot simply use return, because I wait for my code exit status, and the exit status is not just 0 and is important in my application.

If you insist on using exit():

#include<iostream>
int main(){
    {
        std::string myname("Are there any leaks?");
    }
    exit(0);
}

Also, when you return from main returned value becomes application exit code. So if you want to pass exit code, use return exitCode; in main() instead of exit.

Regarding that part:

This also raise another question for me, is such a code harmful?

Yes, because it is a BAD programming habit.
OS will clean up memory you failed to release, so as long as you haven't managed to eat all system memory and page file, you shouldn't damage OS. However, writing sloppy/leaky code might turn into habit, so relying on OS for cleaning up your mess is a bad idea.