Best c++ questions in January 2012

How to drive C#, C++ or Java compiler to compute 1+2+3+...+1000?

69 votes

In a recent interview, I was asked a really strange question. The interviewer asked me how can I compute 1+2+3+...+1000 just using compiler features. This means that I am not allowed to write a program and execute it, but I should just write a program that could drive the compiler to compute this sum while compilation and print the result when compilation completes. As a hint, he told me that I may use generics and pre-processor features of the compiler. It is possible to use C++, C# or Java compiler. Any ideas???

This question is not related to computing the sum without any loops asked here. In addition, It should be noted that the sum SHOULD be calculated during compilation. Printing just the result using C++ compiler directives is not acceptable.

Edit

Reading more about the posted answers, I found that solving problems during compilation using C++ templates is called metaprogramming. This is a technique that was discovered accidentally by Dr. Erwin Unruh, during the process of standardizing the C++ language. You may read more about this topic on wiki page of meta-programming. It seems that it is possible to write the program in Java using java annotations. Take look at maress's answer below.

Updated Now with improved recursion depth! Works on MSVC10 and GCC without increased depth. :)


Simple compile-time recursion + addition:

template<unsigned Cur, unsigned Goal>
struct adder{
  static unsigned const sub_goal = (Cur + Goal) / 2;
  static unsigned const tmp = adder<Cur, sub_goal>::value;
  static unsigned const value = tmp + adder<sub_goal+1, Goal>::value;
};

template<unsigned Goal>
struct adder<Goal, Goal>{
  static unsigned const value = Goal;
};

Testcode:

template<unsigned Start>
struct sum_from{
  template<unsigned Goal>
  struct to{
    template<unsigned N>
    struct equals;

    typedef equals<adder<Start, Goal>::value> result;
  };
};

int main(){
  sum_from<1>::to<1000>::result();
}

Output for GCC:

error: declaration of ‘struct sum_from<1u>::to<1000u>::equals<500500u>’

Live example on Ideone.

Output for MSVC10:

error C2514: 'sum_from<Start>::to<Goal>::equals<Result>' : class has no constructors
      with
      [
          Start=1,
          Goal=1000,
          Result=500500
      ]

How bad is "if (!this)" in a C++ member function?

53 votes

If I come across old code that does if (!this) return; in an app, how severe a risk is this? Is it a dangerous ticking time bomb that requires an immediate app-wide search and destroy effort, or is it more like a code smell that can be quietly left in place?

I am not planning on writing code that does this, of course. Rather, I've recently discovered something in an old core library used by many pieces of our app.

Imagine a CLookupThingy class has a non-virtual CThingy *CLookupThingy::Lookup( name ) member function. Apparently one of the programmers back in those cowboy days encountered many crashes where NULL CLookupThingy *s were being passed from functions, and rather than fixing hundreds of call sites, he quietly fixed up Lookup():

CThingy *CLookupThingy::Lookup( name ) 
{
   if (!this)
   {
      return NULL;
   }
   // else do the lookup code...
}

// now the above can be used like
CLookupThingy *GetLookup() 
{
  if (notReady()) return NULL;
  // else etc...
}

CThingy *pFoo = GetLookup()->Lookup( "foo" ); // will set pFoo to NULL without crashing

I discovered this gem earlier this week, but now am conflicted as to whether I ought to fix it. This is in a core library used by all of our apps. Several of those apps have already been shipped to millions of customers, and it seems to be working fine; there are no crashes or other bugs from that code. Removing the if !this in the lookup function will mean fixing thousands of call sites that potentially pass NULL; inevitably some will be missed, introducing new bugs that will pop up randomly over the next year of development.

So I'm inclined to leave it alone, unless absolutely necessary.

Given that it is technically undefined behavior, how dangerous is if (!this) in practice? Is it worth man-weeks of labor to fix, or can MSVC and GCC be counted on to safely return?

Our app compiles on MSVC and GCC, and runs on Windows, Ubuntu, and MacOS. Portability to other platforms is irrelevant. The function in question is guaranteed to never be virtual.

Edit: The kind of objective answer I am looking for is something like

  • "Current versions of MSVC and GCC use an ABI where nonvirtual members are really statics with an implicit 'this' parameter; therefore they will safely branch into the function even if 'this' is NULL" or
  • "a forthcoming version of GCC will change the ABI so that even nonvirtual functions require loading a branch target from the class pointer" or
  • "the current GCC 4.5 has an inconsistent ABI where sometimes it compiles nonvirtual members as direct branches with an implicit parameter, and sometimes as class-offset function pointers."

The former means the code is stinky but unlikely to break; the second is something to test after a compiler upgrade; the latter requires immediate action even at high cost.

Clearly this is a latent bug waiting to happen, but right now I'm only concerned with mitigating risk on our specific compilers.

I would leave it alone. This might have been a deliberate choice as an old-fashioned version of the SafeNavigationOperator. As you say, in new code, I wouldn't recommend it, but for existing code, I'd leave it alone. If you do end up modifying it, I'd make sure that all calls to it are well-covered by tests.

Edit to add: you could choose to remove it only in debug versions of your code via:

CThingy *CLookupThingy::Lookup( name ) 
{
#if !defined(DEBUG)
   if (!this)
   {
      return NULL;
   }
#endif
   // else do the lookup code...
}

Thus, it wouldn't break anything on production code, while giving you a chance to test it in debug mode.

Is the pImpl idiom really used in practice?

41 votes

I am reading the book "Exceptional C++" by Herb Sutter, and in that book I have learned about the pImpl idiom. Basically, the idea is to create an structure for the private objects of a class and dynamically allocate them to decrease the compilation time (and also hide the private implementations in a better manner).

For example:

class X
{
private:
  C c;
  D d;  
} ;

could be changed to:

class X
{
private:
  struct XImpl;
  XImpl* pImpl;       
};

and, in the CPP, the definition:

struct X::XImpl
{
  C c;
  D d;
};

This seems pretty interesting, but I have never seen this kind of approach before, neither in the companies I have worked, nor in open source projects that I've seen the source code. So, I am wondering it this technique is really used in practice?

Should I use it everywhere, or with caution? And is this technique recommended to be used in embedded systems (where the performance is very important)?

So, I am wondering it this technique is really used in practice? Should I use it everywhere, or with caution?

Of course it is used, and in my project, in almost every class, for two reasons you mentioned :

  • data hiding
  • recompilation time is really decreased, since only the source file needs to be rebuild, but not the header, and every file that includes it
  • binary compatibility. Since the class declaration doesn't change, it is safe to just update the library (assuming you are creating a library)

is this technique recommended to be used in embedded systems (where the performance is very important)?

That depends on how powerful your target is. However the only answer to this question is : measure and evaluate what you get and lose.

C++ vs Java? Why does the ICC generate slower code than VC?

36 votes

The following is a simple loop in C++. The timer is using QueryPerformanceCounter() and is quite accurate. I found Java to take 60% of the time C++ takes and this can't be?! What am I doing wrong here? Even strict aliasing (which is not included in the code here) doesn't help at all...

long long var = 0;
std::array<int, 1024> arr;
int* arrPtr = arr.data();
CHighPrecisionTimer timer;

for(int i = 0; i < 1024; i++) arrPtr[i] = i;

timer.Start();

for(int i = 0; i < 1024 * 1024 * 10; i++){
    for(int x = 0; x < 1024; x++){
        var += arrPtr[x];
    }
}

timer.Stop();

printf("Unrestricted: %lld us, Value = %lld\n", (Int64)timer.GetElapsed().GetMicros(), var);

This C++ runs through in about 9.5 seconds. I am using the Intel Compiler 12.1 with host processor optimization (specifically for mine) and everything maxed. So this is Intel Compiler at its best! Auto-Parallelization funnily consumes 70% CPU instead of 25% but doesn't get the job done any faster ;)...

Now I use the following Java code for comparison:

    long var = 0;
    int[] arr = new int[1024];

    for(int i = 0; i < 1024; i++) arr[i] = i;

    for(int i = 0; i < 1024 * 1024; i++){
        for(int x = 0; x < 1024; x++){
            var += arr[x];
        }
    }

    long nanos = System.nanoTime();

    for(int i = 0; i < 1024 * 1024 * 10; i++){
        for(int x = 0; x < 1024; x++){
            var += arr[x];
        }
    }

    nanos = (System.nanoTime() - nanos) / 1000;

    System.out.print("Value: " + var + ", Time: " + nanos);

The Java code is invoked with aggressive optimization and the server VM (no debug). It runs in about 7 seconds on my machine (only uses one thread).

Is this a failure of the Intel Compiler or am I just too dumb again?

[EDIT]: Ok now heres the thing... Seems more like a bug in the Intel compiler ^^. [Please note that I am running on the Intel Quadcore Q6600, which is rather old. And it might be that the Intel Compiler performs way better on recent CPUs, like Core i7]

Intel x86 (without vectorization): 3 seconds
MSVC x64: 5 seconds
Java x86/x64 (Oracle Java 7): 7 seconds
Intel x64 (with vectorization): 9.5 seconds
Intel x86 (with vectorization): 9.5 seconds
Intel x64 (without vectorization): 12 seconds
MSVC x86: 15 seconds (uhh)

[EDIT]: Another nice case ;). Consider the following trivial lambda expression

#include <stdio.h>
#include <tchar.h>
#include <Windows.h>
#include <vector>
#include <boost/function.hpp>
#include <boost/lambda/bind.hpp>
#include <boost/typeof/typeof.hpp>

template<class TValue>
struct ArrayList
{
private:
    std::vector<TValue> m_Entries;
public:

    template<class TCallback>
    void Foreach(TCallback inCallback)
    {
        for(int i = 0, size = m_Entries.size(); i < size; i++)
        {
            inCallback(i);
        }
    }

    void Add(TValue inValue)
    {
        m_Entries.push_back(inValue);
    }
};

int _tmain(int argc, _TCHAR* argv[])
{
    auto t = [&]() {};


    ArrayList<int> arr;
    int res = 0;

    for(int i = 0; i < 100; i++)
    {
        arr.Add(i);
    }

    long long freq, t1, t2;

    QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
    QueryPerformanceCounter((LARGE_INTEGER*)&t1);

    for(int i = 0; i < 1000 * 1000 * 10; i++)
    {
        arr.Foreach([&](int v) {
            res += i;
        });
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);

    printf("Time: %lld\n", ((t2-t1) * 1000000) / freq);

    if(res == 4950)
        return -1;

    return 0;
}

Intel compiler shines again:

MSVC x86/x64: 12 milli seconds
Intel x86/x64: 1 second

Uhm?! Well, I guess 90 times slower is not a bad thing...

I am not really sure anymore that this applies: Okay and based on an answer to this thread: The intel compiler is known (and I knew that too but I just didn't think about that they could drop support for their processors) to have terrible performance on processors which are not "known" to the compiler, like AMD processors, and maybe even outdated Intel processors like mine... So if someone with a recent Intel processor could try this out it would be nice ;).

Here is the x64 output of the Intel Compiler:

    std::array<int, 1024> arr;
    int* arrPtr = arr.data();
    QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
000000013F05101D  lea         rcx,[freq]  
000000013F051022  call        qword ptr [__imp_QueryPerformanceFrequency (13F052000h)]  

    for(int i = 0; i < 1024; i++) arrPtr[i] = i;
000000013F051028  mov         eax,4  
000000013F05102D  movd        xmm0,eax  
000000013F051031  xor         eax,eax  
000000013F051033  pshufd      xmm1,xmm0,0  
000000013F051038  movdqa      xmm0,xmmword ptr [__xi_z+28h (13F0521A0h)]  
000000013F051040  movdqa      xmmword ptr arr[rax*4],xmm0  
000000013F051046  paddd       xmm0,xmm1  
000000013F05104A  movdqa      xmmword ptr [rsp+rax*4+60h],xmm0  
000000013F051050  paddd       xmm0,xmm1  
000000013F051054  movdqa      xmmword ptr [rsp+rax*4+70h],xmm0  
000000013F05105A  paddd       xmm0,xmm1  
000000013F05105E  movdqa      xmmword ptr [rsp+rax*4+80h],xmm0  
000000013F051067  add         rax,10h  
000000013F05106B  paddd       xmm0,xmm1  
000000013F05106F  cmp         rax,400h  
000000013F051075  jb          wmain+40h (13F051040h)  

    QueryPerformanceCounter((LARGE_INTEGER*)&t1);
000000013F051077  lea         rcx,[t1]  
000000013F05107C  call        qword ptr [__imp_QueryPerformanceCounter (13F052008h)]  
            var += arrPtr[x];
000000013F051082  movdqa      xmm1,xmmword ptr [__xi_z+38h (13F0521B0h)]  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
000000013F05108A  xor         eax,eax  
            var += arrPtr[x];
000000013F05108C  movdqa      xmm0,xmmword ptr [__xi_z+48h (13F0521C0h)]  
    long long var = 0, freq, t1, t2;
000000013F051094  pxor        xmm6,xmm6  
        for(int x = 0; x < 1024; x++){
000000013F051098  xor         r8d,r8d  
            var += arrPtr[x];
000000013F05109B  lea         rdx,[arr]  
000000013F0510A0  xor         ecx,ecx  
000000013F0510A2  movq        xmm2,mmword ptr arr[rcx]  
        for(int x = 0; x < 1024; x++){
000000013F0510A8  add         r8,8  
            var += arrPtr[x];
000000013F0510AC  punpckldq   xmm2,xmm2  
        for(int x = 0; x < 1024; x++){
000000013F0510B0  add         rcx,20h  
            var += arrPtr[x];
000000013F0510B4  movdqa      xmm3,xmm2  
000000013F0510B8  pand        xmm2,xmm0  
000000013F0510BC  movq        xmm4,mmword ptr [rdx+8]  
000000013F0510C1  psrad       xmm3,1Fh  
000000013F0510C6  punpckldq   xmm4,xmm4  
000000013F0510CA  pand        xmm3,xmm1  
000000013F0510CE  por         xmm3,xmm2  
000000013F0510D2  movdqa      xmm5,xmm4  
000000013F0510D6  movq        xmm2,mmword ptr [rdx+10h]  
000000013F0510DB  psrad       xmm5,1Fh  
000000013F0510E0  punpckldq   xmm2,xmm2  
000000013F0510E4  pand        xmm5,xmm1  
000000013F0510E8  paddq       xmm6,xmm3  
000000013F0510EC  pand        xmm4,xmm0  
000000013F0510F0  movdqa      xmm3,xmm2  
000000013F0510F4  por         xmm5,xmm4  
000000013F0510F8  psrad       xmm3,1Fh  
000000013F0510FD  movq        xmm4,mmword ptr [rdx+18h]  
000000013F051102  pand        xmm3,xmm1  
000000013F051106  punpckldq   xmm4,xmm4  
000000013F05110A  pand        xmm2,xmm0  
000000013F05110E  por         xmm3,xmm2  
000000013F051112  movdqa      xmm2,xmm4  
000000013F051116  paddq       xmm6,xmm5  
000000013F05111A  psrad       xmm2,1Fh  
000000013F05111F  pand        xmm4,xmm0  
000000013F051123  pand        xmm2,xmm1  
        for(int x = 0; x < 1024; x++){
000000013F051127  add         rdx,20h  
            var += arrPtr[x];
000000013F05112B  paddq       xmm6,xmm3  
000000013F05112F  por         xmm2,xmm4  
        for(int x = 0; x < 1024; x++){
000000013F051133  cmp         r8,400h  
            var += arrPtr[x];
000000013F05113A  paddq       xmm6,xmm2  
        for(int x = 0; x < 1024; x++){
000000013F05113E  jb          wmain+0A2h (13F0510A2h)  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
000000013F051144  inc         eax  
000000013F051146  cmp         eax,0A00000h  
000000013F05114B  jb          wmain+98h (13F051098h)  
        }
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);
000000013F051151  lea         rcx,[t2]  
000000013F051156  call        qword ptr [__imp_QueryPerformanceCounter (13F052008h)]  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013F05115C  mov         r9,qword ptr [t2]  
    long long var = 0, freq, t1, t2;
000000013F051161  movdqa      xmm0,xmm6  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013F051165  sub         r9,qword ptr [t1]  
000000013F05116A  lea         rcx,[string "Unrestricted: %lld ms, Value = %"... (13F0521D0h)]  
000000013F051171  imul        rax,r9,3E8h  
000000013F051178  cqo  
000000013F05117A  mov         r10,qword ptr [freq]  
000000013F05117F  idiv        rax,r10  
    long long var = 0, freq, t1, t2;
000000013F051182  psrldq      xmm0,8  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013F051187  mov         rdx,rax  
    long long var = 0, freq, t1, t2;
000000013F05118A  paddq       xmm6,xmm0  
000000013F05118E  movd        r8,xmm6  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013F051193  call        qword ptr [__imp_printf (13F052108h)]  

And this one is the assembly of the MSVC x64 build:

int _tmain(int argc, _TCHAR* argv[])
{
000000013FF61000  push        rbx  
000000013FF61002  mov         eax,1050h  
000000013FF61007  call        __chkstk (13FF61950h)  
000000013FF6100C  sub         rsp,rax  
000000013FF6100F  mov         rax,qword ptr [__security_cookie (13FF63000h)]  
000000013FF61016  xor         rax,rsp  
000000013FF61019  mov         qword ptr [rsp+1040h],rax  
    long long var = 0, freq, t1, t2;
    std::array<int, 1024> arr;
    int* arrPtr = arr.data();
    QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
000000013FF61021  lea         rcx,[rsp+28h]  
000000013FF61026  xor         ebx,ebx  
000000013FF61028  call        qword ptr [__imp_QueryPerformanceFrequency (13FF62000h)]  

    for(int i = 0; i < 1024; i++) arrPtr[i] = i;
000000013FF6102E  xor         r11d,r11d  
000000013FF61031  lea         rax,[rsp+40h]  
000000013FF61036  mov         dword ptr [rax],r11d  
000000013FF61039  inc         r11d  
000000013FF6103C  add         rax,4  
000000013FF61040  cmp         r11d,400h  
000000013FF61047  jl          wmain+36h (13FF61036h)  

    QueryPerformanceCounter((LARGE_INTEGER*)&t1);
000000013FF61049  lea         rcx,[rsp+20h]  
000000013FF6104E  call        qword ptr [__imp_QueryPerformanceCounter (13FF62008h)]  
000000013FF61054  mov         r11d,0A00000h  
000000013FF6105A  nop         word ptr [rax+rax]  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
        for(int x = 0; x < 1024; x++){
000000013FF61060  xor         edx,edx  
000000013FF61062  xor         r8d,r8d  
000000013FF61065  lea         rcx,[rsp+48h]  
000000013FF6106A  xor         r9d,r9d  
000000013FF6106D  mov         r10d,100h  
000000013FF61073  nop         word ptr [rax+rax]  
            var += arrPtr[x];
000000013FF61080  movsxd      rax,dword ptr [rcx-8]  
000000013FF61084  add         rcx,10h  
000000013FF61088  add         rbx,rax  
000000013FF6108B  movsxd      rax,dword ptr [rcx-14h]  
000000013FF6108F  add         r9,rax  
000000013FF61092  movsxd      rax,dword ptr [rcx-10h]  
000000013FF61096  add         r8,rax  
000000013FF61099  movsxd      rax,dword ptr [rcx-0Ch]  
000000013FF6109D  add         rdx,rax  
000000013FF610A0  dec         r10  
000000013FF610A3  jne         wmain+80h (13FF61080h)  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
        for(int x = 0; x < 1024; x++){
000000013FF610A5  lea         rax,[rdx+r8]  
000000013FF610A9  add         rax,r9  
000000013FF610AC  add         rbx,rax  
000000013FF610AF  dec         r11  
000000013FF610B2  jne         wmain+60h (13FF61060h)  
        }
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);
000000013FF610B4  lea         rcx,[rsp+30h]  
000000013FF610B9  call        qword ptr [__imp_QueryPerformanceCounter (13FF62008h)]  

    printf("Unrestricted: %lld ms, Value = %lld\n", ((t2-t1)*1000/freq), var);
000000013FF610BF  mov         rax,qword ptr [rsp+30h]  
000000013FF610C4  lea         rcx,[string "Unrestricted: %lld ms, Value = %"... (13FF621B0h)]  
000000013FF610CB  sub         rax,qword ptr [rsp+20h]  
000000013FF610D0  mov         r8,rbx  
000000013FF610D3  imul        rax,rax,3E8h  
000000013FF610DA  cqo  
000000013FF610DC  idiv        rax,qword ptr [rsp+28h]  
000000013FF610E1  mov         rdx,rax  
000000013FF610E4  call        qword ptr [__imp_printf (13FF62138h)]  

    return 0;
000000013FF610EA  xor         eax,eax  

Intel Compiler configured without Vectorization, 64-Bit, highest optimizations (this is surprisingly slow, 12 seconds):

000000013FC0102F  lea         rcx,[freq]  

    double var = 0; long long freq, t1, t2;
000000013FC01034  xorps       xmm6,xmm6  
    std::array<double, 1024> arr;
    double* arrPtr = arr.data();
    QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
000000013FC01037  call        qword ptr [__imp_QueryPerformanceFrequency (13FC02000h)]  

    for(int i = 0; i < 1024; i++) arrPtr[i] = i;
000000013FC0103D  mov         eax,2  
000000013FC01042  mov         rdx,100000000h  
000000013FC0104C  movd        xmm0,eax  
000000013FC01050  xor         eax,eax  
000000013FC01052  pshufd      xmm1,xmm0,0  
000000013FC01057  movd        xmm0,rdx  
000000013FC0105C  nop         dword ptr [rax]  
000000013FC01060  cvtdq2pd    xmm2,xmm0  
000000013FC01064  paddd       xmm0,xmm1  
000000013FC01068  cvtdq2pd    xmm3,xmm0  
000000013FC0106C  paddd       xmm0,xmm1  
000000013FC01070  cvtdq2pd    xmm4,xmm0  
000000013FC01074  paddd       xmm0,xmm1  
000000013FC01078  cvtdq2pd    xmm5,xmm0  
000000013FC0107C  movaps      xmmword ptr arr[rax*8],xmm2  
000000013FC01081  paddd       xmm0,xmm1  
000000013FC01085  movaps      xmmword ptr [rsp+rax*8+60h],xmm3  
000000013FC0108A  movaps      xmmword ptr [rsp+rax*8+70h],xmm4  
000000013FC0108F  movaps      xmmword ptr [rsp+rax*8+80h],xmm5  
000000013FC01097  add         rax,8  
000000013FC0109B  cmp         rax,400h  
000000013FC010A1  jb          wmain+60h (13FC01060h)  

    QueryPerformanceCounter((LARGE_INTEGER*)&t1);
000000013FC010A3  lea         rcx,[t1]  
000000013FC010A8  call        qword ptr [__imp_QueryPerformanceCounter (13FC02008h)]  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
000000013FC010AE  xor         eax,eax  
        for(int x = 0; x < 1024; x++){
000000013FC010B0  xor         edx,edx  
            var += arrPtr[x];
000000013FC010B2  lea         ecx,[rdx+rdx]  
        for(int x = 0; x < 1024; x++){
000000013FC010B5  inc         edx  
        for(int x = 0; x < 1024; x++){
000000013FC010B7  cmp         edx,200h  
            var += arrPtr[x];
000000013FC010BD  addsd       xmm6,mmword ptr arr[rcx*8]  
000000013FC010C3  addsd       xmm6,mmword ptr [rsp+rcx*8+58h]  
        for(int x = 0; x < 1024; x++){
000000013FC010C9  jb          wmain+0B2h (13FC010B2h)  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
000000013FC010CB  inc         eax  
000000013FC010CD  cmp         eax,0A00000h  
000000013FC010D2  jb          wmain+0B0h (13FC010B0h)  
        }
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);
000000013FC010D4  lea         rcx,[t2]  
000000013FC010D9  call        qword ptr [__imp_QueryPerformanceCounter (13FC02008h)]  

Intel Compiler without vectorization, 32-Bit and highest optimization (this one clearly is the winner now, runs in about 3 seconds and the assembly looks much better):

00B81088  lea         eax,[t1]  
00B8108C  push        eax  
00B8108D  call        dword ptr [__imp__QueryPerformanceCounter@4 (0B82004h)]  
00B81093  xor         eax,eax  
00B81095  pxor        xmm0,xmm0  
00B81099  movaps      xmm1,xmm0  
        for(int x = 0; x < 1024; x++){
00B8109C  xor         edx,edx  
            var += arrPtr[x];
00B8109E  addpd       xmm0,xmmword ptr arr[edx*8]  
00B810A4  addpd       xmm1,xmmword ptr [esp+edx*8+40h]  
00B810AA  addpd       xmm0,xmmword ptr [esp+edx*8+50h]  
00B810B0  addpd       xmm1,xmmword ptr [esp+edx*8+60h]  
        for(int x = 0; x < 1024; x++){
00B810B6  add         edx,8  
00B810B9  cmp         edx,400h  
00B810BF  jb          wmain+9Eh (0B8109Eh)  

    for(int i = 0; i < 1024 * 1024 * 10; i++){
00B810C1  inc         eax  
00B810C2  cmp         eax,0A00000h  
00B810C7  jb          wmain+9Ch (0B8109Ch)  

    double var = 0; long long freq, t1, t2;
00B810C9  addpd       xmm0,xmm1  
        }
    }

    QueryPerformanceCounter((LARGE_INTEGER*)&t2);
00B810CD  lea         eax,[t2]  
00B810D1  push        eax  
00B810D2  movaps      xmmword ptr [esp+4],xmm0  
00B810D7  call        dword ptr [__imp__QueryPerformanceCounter@4 (0B82004h)]  
00B810DD  movaps      xmm0,xmmword ptr [esp]

tl;dr: What you're seeing here seems to be ICC's failed attempt at vectorizing the loop.

Let's start with MSVC x64:

Here's the critical loop:

$LL3@main:
movsxd  rax, DWORD PTR [rdx-4]
movsxd  rcx, DWORD PTR [rdx-8]
add rdx, 16
add r10, rax
movsxd  rax, DWORD PTR [rdx-16]
add rbx, rcx
add r9, rax
movsxd  rax, DWORD PTR [rdx-12]
add r8, rax
dec r11
jne SHORT $LL3@main

What you see here is the standard loop unrolling by the compiler. MSVC is unrolling to 4 iterations, and splitting the var variable across four registers: r10, rbx, r9, and r8. Then at the end of the loop, these 4 registers are summed up back together.

Here's where the 4 sums are recombined:

lea rax, QWORD PTR [r8+r9]
add rax, r10
add rbx, rax
dec rdi
jne SHORT $LL6@main

Note that MSVC currently does not do automatic vectorization.


Now let's look at part of your ICC output:

000000013F0510A2  movq        xmm2,mmword ptr arr[rcx]  
000000013F0510A8  add         r8,8  
000000013F0510AC  punpckldq   xmm2,xmm2  
000000013F0510B0  add         rcx,20h  
000000013F0510B4  movdqa      xmm3,xmm2  
000000013F0510B8  pand        xmm2,xmm0  
000000013F0510BC  movq        xmm4,mmword ptr [rdx+8]  
000000013F0510C1  psrad       xmm3,1Fh  
000000013F0510C6  punpckldq   xmm4,xmm4  
000000013F0510CA  pand        xmm3,xmm1  
000000013F0510CE  por         xmm3,xmm2  
000000013F0510D2  movdqa      xmm5,xmm4  
000000013F0510D6  movq        xmm2,mmword ptr [rdx+10h]  
000000013F0510DB  psrad       xmm5,1Fh  
000000013F0510E0  punpckldq   xmm2,xmm2  
000000013F0510E4  pand        xmm5,xmm1  
000000013F0510E8  paddq       xmm6,xmm3  

...

What you're seeing here is an attempt by ICC to vectorize this loop. This is done in a similar manner as what MSVC did (splitting into multiple sums), but using SSE registers instead and with two sums per register.

But it turns out that the overhead of vectorization happens to outweigh the benefits of vectorizing.

If we walk these instructions down one-by-one, we can see how ICC tries to vectorize it:

//  Load two ints using a 64-bit load.  {x, y, 0, 0}
movq        xmm2,mmword ptr arr[rcx]  

//  Shuffle the data into this form.
punpckldq   xmm2,xmm2           xmm2 = {x, x, y, y}
movdqa      xmm3,xmm2           xmm3 = {x, x, y, y}

//  Mask out index 1 and 3.
pand        xmm2,xmm0           xmm2 = {x, 0, y, 0}

//  Arithmetic right-shift to copy sign-bit across the word.
psrad       xmm3,1Fh            xmm3 = {sign(x), sign(x), sign(y), sign(y)}

//  Mask out index 0 and 2.
pand        xmm3,xmm1           xmm3 = {0, sign(x), 0, sign(y)}

//  Combine to get sign-extended values.
por         xmm3,xmm2           xmm3 = {x, sign(x), y, sign(y)}
                                xmm3 = {x, y}

//  Add to accumulator...
paddq       xmm6,xmm3

So it's doing some very messy unpacking just to vectorize. The mess comes from needing to sign-extend the 32-bit integers to 64-bit using only SSE instructions.

SSE4.1 actually provides the PMOVSXDQ instruction for this purpose. But either the target machine doesn't support SSE4.1, or ICC isn't smart enough to use it in this case.

But the point is:

The Intel compiler is trying to vectorize the loop. But the overhead added seems to outweigh the benefit of vectorizing it in the first place. Hence why it's slower.


EDIT : Update with OP's results on:

  • ICC x64 no vectorization
  • ICC x86 with vectorization

You changed the data-type to double. So now it's floating-point. There's no more of that ugly sign-fill shifts that were plaguing the integer version.

But since you disabled vectorization for the x64 version, it obviously becomes slower.

ICC x86 with vectorization:

00B8109E  addpd       xmm0,xmmword ptr arr[edx*8]  
00B810A4  addpd       xmm1,xmmword ptr [esp+edx*8+40h]  
00B810AA  addpd       xmm0,xmmword ptr [esp+edx*8+50h]  
00B810B0  addpd       xmm1,xmmword ptr [esp+edx*8+60h]  
00B810B6  add         edx,8  
00B810B9  cmp         edx,400h  
00B810BF  jb          wmain+9Eh (0B8109Eh)  

Not much here - standard vectorization + 4x loop-unrolling.

ICC x64 with no vectorization:

000000013FC010B2  lea         ecx,[rdx+rdx]  
000000013FC010B5  inc         edx  
000000013FC010B7  cmp         edx,200h  
000000013FC010BD  addsd       xmm6,mmword ptr arr[rcx*8]  
000000013FC010C3  addsd       xmm6,mmword ptr [rsp+rcx*8+58h]  
000000013FC010C9  jb          wmain+0B2h (13FC010B2h)  

No vectorization + only 2x loop-unrolling.

All things equal, disabling vectorization will hurt performance in this floating-point case.

Memory leak C++

35 votes

I just wrote a code in C++ which does some string manipulation, but when I ran valgrind over, it shows some possible memory leaks. Debugging the code to granular level I wrote a simple C++ program looking like:

#include<iostream>
using namespace std;
int main()
{
        std::string myname("Is there any leaks");
        exit(0);
}

and running valgrind over it I got:

==20943== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 26 from 1)
==20943== malloc/free: in use at exit: 360,645 bytes in 12,854 blocks.
==20943== malloc/free: 65,451 allocs, 52,597 frees, 2,186,968 bytes allocated.
==20943== For counts of detected errors, rerun with: -v
==20943== searching for pointers to 12,854 not-freed blocks.
==20943== checked 424,628 bytes.
==20943== 
==20943== LEAK SUMMARY:
==20943==    definitely lost: 0 bytes in 0 blocks.
==20943==      possibly lost: 917 bytes in 6 blocks.
==20943==    still reachable: 359,728 bytes in 12,848 blocks.
==20943==         suppressed: 0 bytes in 0 blocks.
==20943== Reachable blocks (those to which a pointer was found) are not shown.
==20943== To see them, rerun with: --show-reachable=yes

Then it struck me that we have forcefully exited (which i performed in my original C++ code as well). Now the problem is that I want to exit from the program as my previous old code waits for the exit status of the new code. For e.g binary a.out waits for the exit status of b.out. Is there any way to avoid the memory leaks, or should i really worry about the memory leaks as the program is already exiting at that point.

This also raise another question for me, is such a code harmful?

#include<stdio.h>
int main()
{
        char *p=(char *)malloc(sizeof(char)*1000);
        exit(0);
}

EDIT : I cannot simply use return, because I wait for my code exit status, and the exit status is not just 0 and is important in my application.

If you insist on using exit():

#include<iostream>
int main(){
    {
        std::string myname("Are there any leaks?");
    }
    exit(0);
}

Also, when you return from main returned value becomes application exit code. So if you want to pass exit code, use return exitCode; in main() instead of exit.

Regarding that part:

This also raise another question for me, is such a code harmful?

Yes, because it is a BAD programming habit.
OS will clean up memory you failed to release, so as long as you haven't managed to eat all system memory and page file, you shouldn't damage OS. However, writing sloppy/leaky code might turn into habit, so relying on OS for cleaning up your mess is a bad idea.

Why is it a memory leak? What could I catch if I shall use such things in C++?

32 votes

I wonder, if I allocate memory for the pointer of some class/struct, why shall I get the memory leak?

For example:

class A { ... };
struct B { ... };

A *object1 = new A();
B object2 = *(new B());

So, we have some class or struct in C++ program. I learned C# firstly, C++ is my next step, as I understand, operator "new" in C++ isn't similar to C# ones.

Can you describe me the reason of the memory leak in the upper sample?

Best Regards,

What is happening

When you write T t; you're creating an object of type T with automatic storage duration. It will get cleaned up automatically when it goes out of scope.

When you write new T() you're creating an object of type T with dynamic storage duration. It won't get cleaned up automatically.

new without cleanup

You need to pass a pointer to it to delete in order to clean it up:

newing with delete

However, your second example is worse: you're dereferencing the pointer, and making a copy of the object. This way you lose the pointer to the object created with new, so you can never delete it even if you wanted!

newing with deref

What you should do

You should prefer automatic storage duration. Need a new object, just write:

A a; // a new object of type A
B b; // a new object of type B

If you do need dynamic storage duration, store the pointer to the allocated object in an automatic storage duration object that deletes it automatically.

template <typename T>
class automatic_pointer {
public:
    automatic_pointer(T* pointer) : pointer(pointer) {}

    // destructor: gets called upon cleanup
    // in this case, we want to use delete
    ~automatic_pointer() { delete pointer; }

    // emulate pointers!
    // with this we can write *p
    T& operator*() const { return *pointer; }
    // and with this we can write p->f()
    T* operator->() const { return pointer; }

private:
    T* pointer;

    // for this example, I'll just forbid copies
    // a smarter class could deal with this some other way
    automatic_pointer(automatic_pointer const&);
    automatic_pointer& operator=(automatic_pointer const&);
};

automatic_pointer<A> a(new A()); // acts like a pointer, but deletes automatically
automatic_pointer<B> b(new B()); // acts like a pointer, but deletes automatically

newing with automatic_pointer

This is a common idiom that goes by the not-very-descriptive name RAII (Resource Acquisition Is Initialization). When you acquire a resource that needs cleanup, you stick it in an object of automatic storage duration so you don't need to worry about cleaning it up. This applies to any resource, be it memory, open files, network connections, or whatever you fancy.

This automatic_pointer thing already exists in various forms, I've just provided it to give an example. A very similar class exists in the standard library called std::unique_ptr.

There's also an old one (pre-C++11) named auto_ptr but it's now deprecated because it has a strange copying behaviour.

And then there are some even smarter examples, like std::shared_ptr, that allows multiple pointers to the same object and only cleans it up when the last pointer is destroyed.

malloc & placement new vs. new

32 votes

I've been looking into this for the past few days, and so far I haven't really found anything convincing other than dogmatic arguments or appeals to tradition (i.e. "it's the C++ way!").

If I'm creating an array of objects, what is the compelling reason (other than ease) for using:

#define MY_ARRAY_SIZE 10

//  ...

my_object * my_array=new my_object [MY_ARRAY_SIZE];

for (int i=0;i<MY_ARRAY_SIZE;++i) my_array[i]=my_object(i);

over

#define MEMORY_ERROR -1
#define MY_ARRAY_SIZE 10

//  ...

my_object * my_array=(my_object *)malloc(sizeof(my_object)*MY_ARRAY_SIZE);
if (my_object==NULL) throw MEMORY_ERROR;

for (int i=0;i<MY_ARRAY_SIZE;++i) new (my_array+i) my_object (i);

As far as I can tell the latter is much more efficient than the former (since you don't initialize memory to some non-random value/call default constructors unnecessarily), and the only difference really is the fact that one you clean up with:

delete [] my_array;

and the other you clean up with:

for (int i=0;i<MY_ARRAY_SIZE;++i) my_array[i].~T();

free(my_array);

I'm out for a compelling reason. Appeals to the fact that it's C++ (not C) and therefore malloc and free shouldn't be used isn't -- as far as I can tell -- compelling as much as it is dogmatic. Is there something I'm missing that makes new [] superior to malloc?

I mean, as best I can tell, you can't even use new [] -- at all -- to make an array of things that don't have a default, parameterless constructor, whereas the malloc method can thusly be used.

If you do not want to get your memory initialized by implicit constructor calls, and just need an assured memory allocation for placement new then it is perfectly fine to use malloc and free instead of new[] and delete[].

The compelling reasons of using new over malloc is that new provides implicit initialization through constructor calls, saving you additional memset or related function calls post an malloc And that for new you do not need to check for NULL after every allocation, just enclosing exception handlers will do the job saving you redundant error checking unlike malloc.
These both compelling reasons do not apply to your usage.

which one is performance efficient can only be determined by profiling, there is nothing wrong in the approach you have now. On a side note I don't see a compelling reason as to why use malloc over new[] either.

Does there exist a static_warning?

29 votes

I'm aware of this question which mentions Boost's "STATIC WARNING", but I'd like to ask again, specifically, how I could implement a static_warning which operates similarly to static_assert but only emits a warning at compile time rather than an aborting compilation error.

I'd like something similar to Alexandrescu's proposal for a static assert in pre-C++11 days which somehow managed to print some useful contextual information as part of the error.

It would be acceptable to require that the user enable certain standard compiler warnings in order for this construction to work (perhaps "invalid pointer conversion" or "breaks strict aliasing rules") -- any warning that should be part of a normal compilation anyway can be used.

In short, I want static_warning(false, "Hello world"); to create a compiler warning that should somehow include the string "hello world" in the warning message. Is this possible, say in GCC and MSVC, and how?

I'd happily give out a small reward bounty for any particularly clever solution.


As a bit of explanation: I got the idea when thinking about this question: A static warning would be a useful way to trace through the compile-time process of complex template specializations, which are otherwise fairly hard to debug. A static warning could be used as a simple beacon for the compiler to emit "I'm now compiling this part of the code."


Update. Ideally, the warning would be triggered in the following setup:

template <typename T> struct Foo
{
    static_warning(std::is_pointer<T>::value, "Attempting to use pointer type.");
    // ...
};

int main() { Foo<int> a; Foo<int*> b; }

Playing off of Michael E's comment:

#if defined(__GNUC__)
#define DEPRECATE(foo, msg) foo __attribute__((deprecated(msg)))
#elif defined(_MSC_VER)
#define DEPRECATE(foo, msg) __declspec(deprecated(msg)) foo
#else
#error This compiler is not supported
#endif

#define PP_CAT(x,y) PP_CAT1(x,y)
#define PP_CAT1(x,y) x##y

namespace detail
{
    struct true_type {};
    struct false_type {};
    template <int test> struct converter : public true_type {};
    template <> struct converter<0> : public false_type {};
}

#define STATIC_WARNING(cond, msg) \
struct PP_CAT(static_warning,__LINE__) { \
  DEPRECATE(void _(::detail::false_type const& ),msg) {}; \
  void _(::detail::true_type const& ) {}; \
  PP_CAT(static_warning,__LINE__)() {_(::detail::converter<(cond)>());} \
}

// Note: using STATIC_WARNING_TEMPLATE changes the meaning of a program in a small way.
// It introduces a member/variable declaration.  This means at least one byte of space
// in each structure/class instantiation.  STATIC_WARNING should be preferred in any 
// non-template situation.
//  'token' must be a program-wide unique identifier.
#define STATIC_WARNING_TEMPLATE(token, cond, msg) \
    STATIC_WARNING(cond, msg) PP_CAT(PP_CAT(_localvar_, token),__LINE__)

The macro can be invoked at namespace, structure, and function scope. Given the input:

#line 1
STATIC_WARNING(1==2, "Failed with 1 and 2");
STATIC_WARNING(1<2, "Succeeded with 1 and 2");

struct Foo
{
  STATIC_WARNING(2==3, "2 and 3: oops");
  STATIC_WARNING(2<3, "2 and 3 worked");
};

void func()
{
  STATIC_WARNING(3==4, "Not so good on 3 and 4");
  STATIC_WARNING(3<4, "3 and 4, check");
}

template <typename T> struct wrap
{
  typedef T type;
  STATIC_WARNING(4==5, "Bad with 4 and 5");
  STATIC_WARNING(4<5, "Good on 4 and 5");
  STATIC_WARNING_TEMPLATE(WRAP_WARNING1, 4==5, "A template warning");
};

template struct wrap<int>;

GCC 4.6 (at default warning level) produces:

static_warning.cpp: In constructor ‘static_warning1::static_warning1()’:
static_warning.cpp:1:1: warning: ‘void static_warning1::_(const detail::false_type&)’ 
    is deprecated (declared at static_warning.cpp:1): Failed with 1 and 2 [-Wdeprecated-declarations]
static_warning.cpp: In constructor ‘Foo::static_warning6::static_warning6()’:
static_warning.cpp:6:3: warning: ‘void Foo::static_warning6::_(const detail::false_type&)’
    is deprecated (declared at static_warning.cpp:6): 2 and 3: oops [-Wdeprecated-declarations]
static_warning.cpp: In constructor ‘func()::static_warning12::static_warning12()’:
static_warning.cpp:12:3: warning: ‘void func()::static_warning12::_(const detail::false_type&)’ 
    is deprecated (declared at static_warning.cpp:12): Not so good on 3 and 4 [-Wdeprecated-declarations]
static_warning.cpp: In constructor ‘wrap<T>::static_warning19::static_warning19() [with T = int]’:
static_warning.cpp:24:17:   instantiated from here
static_warning.cpp:19:3: warning: ‘void wrap<T>::static_warning19::_(const detail::false_type&) [with T = int]’ 
    is deprecated (declared at static_warning.cpp:19): Bad with 4 and 5 [-Wdeprecated-declarations]

While Visual C++ 2010 (at /W3 or above) says:

warnproj.cpp(1): warning C4996: 'static_warning1::_': Failed with 1 and 2
warnproj.cpp(1) : see declaration of 'static_warning1::_'
warnproj.cpp(6): warning C4996: 'Foo::static_warning6::_': 2 and 3: oops
warnproj.cpp(6) : see declaration of 'Foo::static_warning6::_'
warnproj.cpp(12): warning C4996: 'func::static_warning12::_': Not so good on 3 and 4
warnproj.cpp(12) : see declaration of 'func::static_warning12::_'
warnproj.cpp(19): warning C4996: 'wrap<T>::static_warning19::_': Bad with 4 and 5
    with
    [
        T=int
    ]
warnproj.cpp(19) : see declaration of 'wrap<T>::static_warning19::_'
    with
    [
        T=int
    ]
warnproj.cpp(19) : while compiling class template member function 'wrap<T>::static_warning19::static_warning19(void)'
    with
    [
        T=int
    ]
warnproj.cpp(24) : see reference to class template instantiation 'wrap<T>::static_warning19' being compiled
    with
    [
        T=int
    ]

Clang++ 3.1 on Linux produces the arguably nicer output (color not shown):

tst3.cpp:1:1: warning: '_' is deprecated: Failed with 1 and 2
      [-Wdeprecated-declarations]
STATIC_WARNING(1==2, "Failed with 1 and 2");
^
tst3.cpp:24:38: note: expanded from macro 'STATIC_WARNING'
  PP_CAT(static_warning,__LINE__)() {_(::detail::converter<(cond)>());} \
                                     ^
tst3.cpp:6:3: warning: '_' is deprecated: 2 and 3: oops
      [-Wdeprecated-declarations]
  STATIC_WARNING(2==3, "2 and 3: oops");
  ^
tst3.cpp:24:38: note: expanded from macro 'STATIC_WARNING'
  PP_CAT(static_warning,__LINE__)() {_(::detail::converter<(cond)>());} \
                                     ^
tst3.cpp:12:3: warning: '_' is deprecated: Not so good on 3 and 4
      [-Wdeprecated-declarations]
  STATIC_WARNING(3==4, "Not so good on 3 and 4");
  ^
tst3.cpp:24:38: note: expanded from macro 'STATIC_WARNING'
  PP_CAT(static_warning,__LINE__)() {_(::detail::converter<(cond)>());} \
                                     ^
tst3.cpp:19:3: warning: '_' is deprecated: Bad with 4 and 5
      [-Wdeprecated-declarations]
  STATIC_WARNING(4==5, "Bad with 4 and 5");
  ^
tst3.cpp:24:38: note: expanded from macro 'STATIC_WARNING'
  PP_CAT(static_warning,__LINE__)() {_(::detail::converter<(cond)>());} \
                                     ^
tst3.cpp:23:17: note: in instantiation of member function
      'wrap<int>::static_warning19::static_warning19' requested here
template struct wrap<int>
                ^
4 warnings generated.

Why is 'X x; x();' allowed, when 'X' defines a conversion to function pointer, but not, when it defines a conversion to a functor?

26 votes
void f(int){}
typedef void (*f_ptr)(int);

struct Functor{
  void operator()(int){}
};

struct X{
  operator f_ptr(){ return f; }
};

struct Y{
  operator Functor(){ return Functor(); }
};

int main(){
  X x; Y y;
  x(5); // works ?!
  y(5); // doesn't ?!
}

Live example on Ideone. Output:

error: no match for call to '(Y) (int)'

Q1: Why is the call to x(5) allowed, even though X only defines a conversion to function pointer, and not operator()?

Q2: Conversely, why is the same thing not allowed, if we define a conversion to another functor?

x(5); // works ?!

This implicitly casts x to an f_ptr and calls that. C++11 standard:

§ 13.3.1.1.2 Call to object of class type [over.call.object]

2) In addition, for each non-explicit conversion function declared in T of the form

operator conversion-type-id ( ) attribute-specifier-seqopt cv-qualifier ;

[…where conversion-type-id denotes the type “pointer to function of (P1,...,Pn) returning R”…]


y(5); // doesn't ?!

The standard doesn't mention anything about implicit conversion to class types that overload operator() (aka functors), which implies that the compiler doesn't allow that.

You must cast it explicitly:

static_cast<Functor>(y)(5);

Isn't "const" redundant when passing by value?

24 votes

I was reading my C++ book (Deitel) when I came across a function to calculate the volume of a cube. The code is the following:

double cube (const double side){
    return side * side * side;
}

The explanation for using the "const" qualifier was this one: "The const qualified should be used to enforce the principle of least privilege, telling the compiler that the function does not modify variable side".

My question: isn't the use of "const" redundant/unnecessary here since the variable is being passed by value, so the function can't modify it anyway?

The const also prevents code inside the function from modifying the parameter itself. When a function is larger than trivial size, such an assurance helps you to quickly read and understand a function. If you know that the value of side won't change, then you don't have to worry about keeping track of its value over time as you read. Under some circumstances, this might even help the compiler generate better code.

A non-trivial number of people do this as a matter of course, considering it generally good style.

Why is this C code faster than this C++ code ? getting biggest line in file

23 votes

I have two versions of a program that does basically the same thing, getting the biggest length of a line in a file, I have a file with about 8 thousand lines, my code in C is a little bit more primitive (of course!) than the code I have in C++. The C programm takes about 2 seconds to run, while the program in C++ takes 10 seconds to run (same file I am testing with for both cases). But why? I was expecting it to take the same amount of time or a little bit more but not 8 seconds slower!

my code in C:

#include <stdio.h>
#include <stdlib.h> 
#include <string.h>

#if _DEBUG
    #define DEBUG_PATH "../Debug/"
#else
    #define DEBUG_PATH ""
#endif

const char FILE_NAME[] = DEBUG_PATH "data.noun";

int main()
{   
    int sPos = 0;
    int maxCount = 0;
    int cPos = 0;
    int ch;
    FILE *in_file;              

    in_file = fopen(FILE_NAME, "r");
    if (in_file == NULL) 
    {
        printf("Cannot open %s\n", FILE_NAME);
        exit(8);
    }       

    while (1) 
    {
        ch = fgetc(in_file);
        if(ch == 0x0A || ch == EOF) // \n or \r or \r\n or end of file
        {           
            if ((cPos - sPos) > maxCount)
                maxCount = (cPos - sPos);

            if(ch == EOF)
                break;

            sPos = cPos;
        }
        else
            cPos++;
    }

    fclose(in_file);

    printf("Max line length: %i\n",  maxCount); 

    getch();
    return (0);
}

my code in C++:

#include <iostream>
#include <fstream>
#include <stdio.h>
#include <string>

using namespace std;

#ifdef _DEBUG
    #define FILE_PATH "../Debug/data.noun"
#else
    #define FILE_PATH "data.noun"
#endif

int main()
{
    string fileName = FILE_PATH;
    string s = "";
    ifstream file;
    int size = 0;

    file.open(fileName.c_str());
    if(!file)
    {
        printf("could not open file!");
        return 0;
    }

    while(getline(file, s) )
            size = (s.length() > size) ? s.length() : size;
    file.close();

    printf("biggest line in file: %i", size);   

    getchar();
    return 0;
}

The C++ version constantly allocates and deallocates instances of std::string. Memory allocation is a costly operation. In addition to that the constructors/destructors are executed.

The C version however uses constant memory, and just does was necessary: Reading in single characters, setting the line-length counter to the new value if higher, for each newline and that's it.

Array placement-new requires unspecified overhead in the buffer?

22 votes

5.3.4 [expr.new] of the C++11 Feb draft gives the example:

new(2,f) T[5] results in a call of operator new[](sizeof(T)*5+y,2,f).

Here, x and y are non-negative unspecified values representing array allocation overhead; the result of the new-expression will be offset by this amount from the value returned by operator new[]. This overhead may be applied in all array new-expressions, including those referencing the library function operator new[](std::size_t, void*) and other placement allocation functions. The amount of overhead may vary from one invocation of new to another. —end example ]

Now take the following example code:

void* buffer = malloc(sizeof(std::string) * 10);
std::string* p = ::new (buffer) std::string[10];

According to the above quote, the second line new (buffer) std::string[10] will internally call operator new[](sizeof(std::string) * 10 + y, buffer) (before constructing the individual std::string objects). The problem is that if y > 0, the pre-allocated buffer will be too small!

So how do I know how much memory to pre-allocate when using array placement-new?

void* buffer = malloc(sizeof(std::string) * 10 + how_much_additional_space);
std::string* p = ::new (buffer) std::string[10];

Or does the standard somewhere guarantee that y == 0 in this case? Again, the quote says:

This overhead may be applied in all array new-expressions, including those referencing the library function operator new[](std::size_t, void*) and other placement allocation functions.

Don't use operator new[](std::size_t, void* p) unless you know a-priori the answer to this question. The answer is an implementation detail and can change with compiler/platform. Though it is typically stable for any given platform. E.g. this is something specified by the Itanium ABI.

If you don't know the answer to this question, write your own placement array new that can check this at run time:

inline
void*
operator new[](std::size_t n, void* p, std::size_t limit)
{
    if (n <= limit)
        std::cout << "life is good\n";
    else
        throw std::bad_alloc();
    return p;
}

int main()
{
    alignas(std::string) char buffer[100];
    std::string* p = new(buffer, sizeof(buffer)) std::string[3];
}

By varying the array size and inspecting n in the example above, you can infer y for your platform. For my platform y is 1 word. The sizeof(word) varies depending on whether I'm compiling for a 32 bit or 64 bit architecture.

Why is it ambiguous to call overloarded ambig(long) and ambig(unsigned long) with an integer literal?

21 votes

When compiling

void ambig(  signed long) { }
void ambig(unsigned long) { }

int main(void) { ambig(-1); return 0; }

I get

error C2668: 'ambig' : ambiguous call to overloaded function
    could be 'void ambig(unsigned long)'
    or 'void ambig(long)'
while trying to match the argument list '(int)'

I know I can 'fix' it by saying -1L instead of -1, but why/how exactly is this considered ambiguous in the first place?

You're passing an int to this overloaded function.

Although human intuition says that ambig(signed long) ought to be preferred because your input is a negative integer (which cannot be represented as such by an unsigned long), the two conversions are in fact equivalent in "precedence" in C++.

That is, the conversion intunsigned long is considered to be just as valid as intsigned long, and neither is preferred to the other.

On the other hand, if your parameter were already a long rather than an int, then there is an exact match to signed long, with no conversion necessary. This avoids the ambiguity.

void ambig(  signed long) { }
void ambig(unsigned long) { }

int main(void) { ambig(static_cast<long>(-1)); return 0; }

"Just one of those things".


[C++11: 4.13/1]: ("Integer conversion rank")

Every integer type has an integer conversion rank defined as follows:

  • [..]
  • The rank of a signed integer type shall be greater than the rank of any signed integer type with a smaller size.
  • The rank of long long int shall be greater than the rank of long int, which shall be greater than the rank of int, which shall be greater than the rank of short int, which shall be greater than the rank of signed char.
  • The rank of any unsigned integer type shall equal the rank of the corresponding signed integer type.
  • [..]

[ Note: The integer conversion rank is used in the definition of the integral promotions (4.5) and the usual arithmetic conversions (Clause 5). —end note ]

Overload resolution is complex, and is defined in [C++11: 13.3]; I shan't bore you by quoting a majority of it here.

Here's a highlight, though:

[C++11: 13.3.3.1/8]: If no conversions are required to match an argument to a parameter type, the implicit conversion sequence is the standard conversion sequence consisting of the identity conversion (13.3.3.1.1).

[C++11: 13.3.3.1/9]: If no sequence of conversions can be found to convert an argument to a parameter type or the conversion is otherwise ill-formed, an implicit conversion sequence cannot be formed.

[C++11: 13.3.3.1/10]: If several different sequences of conversions exist that each convert the argument to the parameter type, the implicit conversion sequence associated with the parameter is defined to be the unique conversion sequence designated the ambiguous conversion sequence. For the purpose of ranking implicit conversion sequences as described in 13.3.3.2, the ambiguous conversion sequence is treated as a user-defined sequence that is indistinguishable from any other user-defined conversion sequence134. If a function that uses the ambiguous conversion sequence is selected as the best viable function, the call will be ill-formed because the conversion of one of the arguments in the call is ambiguous.

  • /10 is the case you're experiencing; /8 is the case you use with a long argument.

How do smart pointers choose between delete and delete[]?

21 votes

Consider:

delete new std :: string [2];
delete [] new std :: string;

Everyone knows the first is an error. If the second wasn't an error, we wouldn't need two distinct operators.

Now consider:

std :: unique_ptr <int> x (new int [2]);
std :: unique_ptr <int> y (new int);

Does x know to use delete[] as opposed to delete?


Background: this question floated through my head when I thought array type qualification of pointers would be a handy language feature.

int *[] foo = new int [2]; // OK
int *   bar = new int;     // OK
delete [] foo;             // OK
delete bar;                // OK
foo = new int;             // Compile error
bar = new int[2];          // Compile error
delete foo;                // Compile error
delete [] bar;             // Compile error

Unfortunately, they don't know what delete to use therefore they use delete. That's why for each smart pointer we have a smart array counterpart.

std::shared_ptr uses delete
std::shared_array uses delete[]

So, your line

std :: unique_ptr <int> x (new int [2]);

actually causes undefined behavior.

Incidentally, if you write

std :: unique_ptr<int[]> p(new int[2]);
                     ^^

then delete[] will be used since you've explicitly requested that. However, the following line will still be UB.

std :: unique_ptr<int[]> p(new int);

The reason that they can't choose between delete and delete[] is that new int and new int[2] are exactly of the same type - int*.

Here's a related question of using correct deleters in case of smart_ptr<void> and smart_ptr<Base> when Base has no virtual destructor.

Tools to generate higher-quality error messages for template-based code?

20 votes

Concepts, that would render these tools unnecessary, are not part of C++11.

  • STLFilt would have been one option but it is no longer maintained.

  • Clang claims to give expressive diagnostics although important C++11 features are not available yet.

  • colorgcc seems to be abandoned since 1999.

What production quality tools are available to decipher error messages stemming from template-based code? Eclipse-CDT support would be nice too. :)

If I give up on C++11, what options do I have for C++98?


Related questions:

Let's have a stab at an answer (I marked this community wiki so we get a good response together)...

I'm working since a long time with templates and error messages have generally improved in some way or another:

  • Writing a stack of errors creates a lot more text but also typically includes the level the user is looking at and this generally includes a hint at what the actual problem is. Given that the compiler only sees a translation unit tossed at it, there isn't a lot which can be done determining which error in the stack is the one most suitable for the user.
  • Using concept checkers, i.e. classes or functions which exercise all the required members of template arguments and possibly generating errors messages using static_assert() give the template author a way to tell users about assumptions which apparently don't hold.
  • Telling the user about types he writes rather than expanding all typedefs as the compiler like to see at the lowest level also helps. clang is rather good at this and actually gives you error messages e.g. talking about std::string rather than expanding things type to whatever it ends up to be.

A combination of the technique actually causes e.g. clang to create quite decent error message (even if it doesn't implement C++2011, yet; however, no compiler does and as far as I can tell gcc and clang are leading the pack). I know other compiler developers actively work on improving the template error messages as lots of programmers have discovered that templates actually are a huge leap forward even though the error messages are something which takes a bit of getting used to.

One problem tools like stlfilt face is that C++ compilers and libraries are under active development. This results in error messages shifting all the time, causing the tool to receive different outputs. While it is good that compiler writers work on improving error messages, it certainly makes life harder for people who try to work from the error messages they got. There is another side to this as well: once a certain error pattern is detected to be common and is picked up e.g. by stlfilt (well, it isn't actively maintained as far as I know) compiler writers are probably keen to report the errors following these patterns directly, possibly also providing additional information available to the compiler but not emitted before. Put differently, I would expect that compiler writers are quite receptive to reports from users describing common error situations and how they are best reported. The compiler writers may not encounter the errors themselves because the code they are working on is actually C (e.g. gcc is implemented in C) or because they are so used to certain template techniques that they avoid certain errors (e.g. omission of typename for dependent types).

Finally, to address the question about concrete tools: the main "tool" I'm using when I get kind of stuck with a compiler complaining about some template instantiation is to use different compilers! Although it isn't always the case but often one compiler reports an entirely incomprehensible error messages which only makes sense after seeing the fairly concise report from another compiler (in case you are interested, I regularly use recent version of gcc, clang, and EDG for this). I'm not aware of a readily packaged too like stlfilt, however.

C++03. Test for rvalue-vs-lvalue at compile-time, not just at runtime

20 votes

In C++03, Boost's Foreach, using this interesting technique, can detect at run-time whether an expression is an lvalue or an rvalue. (I found that via this StackOverflow question: Rvalues in C++03 )

Here's a demo of this working at run-time

(This is a more basic question that arose while I was thinking about this other recent question of mine. An answer to this might help us answer that other question.)

Now that I've spelled out the question, testing rvalue-ness in C++03 at compile-time, I'll talk a little about the things I've been trying so far.

I want to be able to do this check at compile-time. It's easy in C++11, but I'm curious about C++03.

I'm trying to build upon their idea, but would be open to different approaches also. The basic idea of their technique is to put this code into a macro:

true ? rvalue_probe() : EXPRESSION;

It is 'true' on the left of the ?, and therefore we can be sure that EXPRESSION will never be evaluated. But the interesting thing is that the ?: operator behaves differently depending on whether its parameters are lvalues or rvalues (click that link above for details). In particular, it will convert our rvalue_probe object in one of two ways, depending on whether EXPRESSION is an lvalue or not:

struct rvalue_probe
{
    template< class R > operator       R () { throw "rvalue"; }
    template< class L > operator       L & () const { throw "lvalue"; }
    template< class L > operator const L & () const { throw "const lvalue"; }
};

That works at runtime because the thrown text can be caught and used to analyze whether the EXPRESSION was an lvalue or an rvalue. But I want some way to identify, at compile-time, which conversion is being used.

Now, this is potentially useful because it means that, instead of asking

Is EXPRESSION an rvalue?

we can ask:

When the compiler is compiling true ? rvalue_probe() : EXPRESSION, which of the two overloaded operators, operator X or operator X&, is selected?

( Ordinarily, you could detect which method was called by changing the return types and getting the sizeof it. But we can't do that with these conversion operators, especially when they're buried inside the ?:. )

I thought I might be able to use something like

is_reference< typeof (true ? rvalue_probe() : EXPRESSION) > :: type

If the EXPRESSION is an lvalue, then the operator& is selected and I hoped that the whole expression would then be a & type. But it doesn't seem to work. ref types and non-ref types are pretty hard (impossible?) to distinguish, especially now that I'm trying to dig inside a ?: expression to see which conversion was selected.

Here's the demo code pasted here:

#include <iostream>
using namespace std;
struct X {
        X(){}
};

X x;
X & xr = x;
const X xc;

      X   foo()  { return x; }
const X   fooc() { return x; }
      X & foor()  { return x; }
const X & foorc() { return x; }

struct rvalue_probe
{
        template< class R > operator       R () { throw "rvalue"; }
        // template< class R > operator R const () { throw "const rvalue"; } // doesn't work, don't know why
        template< class L > operator       L & () const { throw "lvalue"; }
        template< class L > operator const L & () const { throw "const lvalue"; }
};

typedef int lvalue_flag[1];
typedef int rvalue_flag[2];
template <typename T> struct isref     { static const int value = 0; typedef lvalue_flag type; };
template <typename T> struct isref<T&> { static const int value = 1; typedef rvalue_flag type; };

int main() {
        try{ true ? rvalue_probe() : x;       } catch (const char * result) { cout << result << endl; } // Y lvalue
        try{ true ? rvalue_probe() : xc;      } catch (const char * result) { cout << result << endl; } // Y const lvalue
        try{ true ? rvalue_probe() : xr;      } catch (const char * result) { cout << result << endl; } // Y       lvalue
        try{ true ? rvalue_probe() : foo();   } catch (const char * result) { cout << result << endl; } // Y rvalue
        try{ true ? rvalue_probe() : fooc();  } catch (const char * result) { cout << result << endl; } // Y rvalue
        try{ true ? rvalue_probe() : foor();  } catch (const char * result) { cout << result << endl; } // Y lvalue
        try{ true ? rvalue_probe() : foorc(); } catch (const char * result) { cout << result << endl; } // Y const lvalue

}

(I had some other code here at the end, but it's just confusing things. You don't really want to see my failed attempts at an answer! The above code demonstrates how it can test lvalue-versus-rvalue at runtime.)

It took some effort, but here's a tested and working is_lvalue macro that correctly handles const struct S function return types. It relies on const struct S rvalues not binding to const volatile struct S&, while const struct S lvalues do.

#include <cassert>

template <typename T>
struct nondeducible
{
  typedef T type;
};

char (& is_lvalue_helper(...))[1];

template <typename T>
char (& is_lvalue_helper(T&, typename nondeducible<const volatile T&>::type))[2];

#define is_lvalue(x) (sizeof(is_lvalue_helper((x),(x))) == 2)

struct S
{
  int i;
};

template <typename T>
void test_()
{
  T a = {0};
  T& b = a;
  T (* c)() = 0;
  T& (* d)() = 0;
  assert (is_lvalue(a));
  assert (is_lvalue(b));
  assert (!is_lvalue(c()));
  assert (is_lvalue(d()));
}

template <typename T>
void test()
{
  test_<T>();
  test_<const T>();
  test_<volatile T>();
  test_<const volatile T>();
}

int main()
{
  test<int>();
  test<S>();
}

Edit: unnecessary extra parameter removed, thanks Xeo.

Edit again: As per the comments, this works with GCC but relies on unspecified behaviour in C++03 (it's valid C++11) and fails some other compilers. Extra parameter restored, which makes it work in more cases. const class rvalues give a hard error on some compilers, and give the correct result (false) on others.

Most accurate way to do a combined multiply-and-divide operation in 64-bit?

18 votes

What is the most accurate way I can do a multiply-and-divide operation for 64-bit integers that works in both 32-bit and 64-bit programs (in Visual C++)? (In case of overflow, I need the result mod 264.)

(I'm looking for something like MulDiv64, except that this one uses inline assembly, which only works in 32-bit programs.)

Obviously, casting to double and back is possible, but I'm wondering if there's a more accurate way that isn't too complicated. (i.e. I'm not looking for arbitrary-precision arithmetic libraries here!)

Since this is tagged Visual C++ I'll give a solution that abuses MSVC-specific intrinsics.

This example is fairly complicated. It's a highly simplified version of the same algorithm that is used by GMP and java.math.BigInteger for large division.

Although I have a simpler algorithm in mind, it's probably about 30x slower.

This solution has the following constraints/behavior:

  • It requires x64. It will not compile on x86.
  • The quotient is not zero.
  • The quotient saturates if it overflows 64-bits.

Note that this is for the unsigned integer case. It's trivial to build a wrapper around this to make it work for signed cases as well. This example should also produce correctly truncated results.

This code is not fully tested. However, it has passed all the tests cases that I've thrown at it.
(Even cases that I've intentionally constructed to try to break the algorithm.)

#include <intrin.h>

uint64_t muldiv2(uint64_t a, uint64_t b, uint64_t c){
    //  Normalize divisor
    unsigned long shift;
    _BitScanReverse64(&shift,c);
    shift = 63 - shift;

    c <<= shift;

    //  Multiply
    a = _umul128(a,b,&b);
    if (((b << shift) >> shift) != b){
        cout << "Overflow" << endl;
        return 0xffffffffffffffff;
    }
    b = __shiftleft128(a,b,shift);
    a <<= shift;


    uint32_t div;
    uint32_t q0,q1;
    uint64_t t0,t1;

    //  1st Reduction
    div = (uint32_t)(c >> 32);
    t0 = b / div;
    if (t0 > 0xffffffff)
        t0 = 0xffffffff;
    q1 = (uint32_t)t0;
    while (1){
        t0 = _umul128(c,(uint64_t)q1 << 32,&t1);
        if (t1 < b || (t1 == b && t0 <= a))
            break;
        q1--;
//        cout << "correction 0" << endl;
    }
    b -= t1;
    if (t0 > a) b--;
    a -= t0;

    if (b > 0xffffffff){
        cout << "Overflow" << endl;
        return 0xffffffffffffffff;
    }

    //  2nd reduction
    t0 = ((b << 32) | (a >> 32)) / div;
    if (t0 > 0xffffffff)
        t0 = 0xffffffff;
    q0 = (uint32_t)t0;

    while (1){
        t0 = _umul128(c,q0,&t1);
        if (t1 < b || (t1 == b && t0 <= a))
            break;
        q0--;
//        cout << "correction 1" << endl;
    }

//    //  (a - t0) gives the modulus.
//    a -= t0;

    return ((uint64_t)q1 << 32) | q0;
}

Note that if you don't need a perfectly truncated result, you can remove the last loop completely. If you do this, the answer will be no more than 2 larger than the correct quotient.

Test Cases:

cout << muldiv2(4984198405165151231,6132198419878046132,9156498145135109843) << endl;
cout << muldiv2(11540173641653250113, 10150593219136339683, 13592284235543989460) << endl;
cout << muldiv2(449033535071450778, 3155170653582908051, 4945421831474875872) << endl;
cout << muldiv2(303601908757, 829267376026, 659820219978) << endl;
cout << muldiv2(449033535071450778, 829267376026, 659820219978) << endl;
cout << muldiv2(1234568, 829267376026, 1) << endl;
cout << muldiv2(6991754535226557229, 7798003721120799096, 4923601287520449332) << endl;
cout << muldiv2(9223372036854775808, 2147483648, 18446744073709551615) << endl;
cout << muldiv2(9223372032559808512, 9223372036854775807, 9223372036854775807) << endl;
cout << muldiv2(9223372032559808512, 9223372036854775807, 12) << endl;
cout << muldiv2(18446744073709551615, 18446744073709551615, 9223372036854775808) << endl;

Output:

3337967539561099935
8618095846487663363
286482625873293138
381569328444
564348969767547451
1023786965885666768
11073546515850664288
1073741824
9223372032559808512
Overflow
18446744073709551615
Overflow
18446744073709551615

How does an application launcher update itself?

16 votes

Launchers are most common in games. Think of League of Legends, Starcraft II, or almost any MMO out there. Before starting the actual game, you have a small launcher app that takes care of updates and patching.

I want to move this direction with a particular non-game application I am developing. The concept of the launcher makes perfect sense: it checks for updates, replaces the appropriate binaries/libraries, maybe runs an integrity check, and launches the application. However, how do launchers update themselves? This tends to be a rare event, but how is it done? Does the launcher literally just write over the very binary it is currently running from? Or is there some kind of swap step after the download? I need to be able to push out (rare) updates to the launcher (especially if I discover some bug in my launcher).

My particular project will be in C#, but I am interested in conceptually similar C++ and/or Java solutions as well for future reference.

I've never tried, but this is what I would guess (assuming you can't overwrite a file being executed. If you can, this is all simpler)

Updater A checks if its the newest version
If launcher isnt the newest version
    Download the differences (to save bandwidth) to file B
    Apply the delta to own code into file C
    Launch file C.
    Close
If file C exists (update happened recently)
    Try to delete C  (update was previous launch, delete temporary file)
    If delete fails  (We are C, means A is out of date)
        Copy C over A  (update launcher)
If game isnt newest version
    Download the differences (to save bandwidth) to file B
    Apply the delta to game into file D
    Rename game->E (no-fail-swap-trick)
    Rename D->game
    delete file E
Run game

André Caron has shown me that the swap trick is done better with transactional file IO.

Can adding 'const' to a pointer help the optimization?

15 votes

I have a pointer int* p, and do some operations in a loop. I do not modify the memory, just read. If I add const to the pointer (both cases, const int* p, and int* const p), can it help a compiler to optimize the code?

I know other merits of const, like safety or self-documentation, I ask about this particular case. Rephrasing the question: can const give the compiler any useful (for optimization) information, ever?

While this is obviously specific to the implementation, it is hard to see how changing a pointer from int* to int const* could ever provide any additional information that the compiler would not otherwise have known.

In both cases, the value pointed to can change during the execution of the loop.

Therefore it probably will not help the compiler optimize the code.

gcc rounding difference between versions

14 votes

I'm looking into why a test case is failing

The problematic test can be reduced to doing (4.0/9.0) ** (1.0/2.6), rounding this to 6 digits and checking against a known value (as a string):

#include<stdio.h>
#include<math.h>
int main(){
    printf("%.06f\n", powf(4.0/9.0, (1.0/2.6)));
}

If I compile and run this in gcc 4.1.2 on Linux, I get:

0.732057

Python agrees, as does Wolfram|Alpha:

$ python2.7 -c 'print "%.06f" % (4.0/9.0)**(1/2.6)'
0.732057

However I get the following result on gcc 4.4.0 on Linux, and 4.2.1 on OS X:

0.732058

A double acts identically (although I didn't test this extensively)

I'm not sure how to narrow this down any further.. Is this a gcc regression? A change in rounding algorithm? Me doing something silly?

Edit: Printing the result to 12 digits, the digit at the 7th place is 4 vs 5, which explains the rounding difference, but not the value difference:

gcc 4.1.2:

0.732057452202

gcc 4.4.0:

0.732057511806

Here's the gcc -S output from both versions: https://gist.github.com/1588729

Recent gcc version are able to use mfpr to do compile time floating point computation. My guess is that your recent gcc does that and use an higher precision for the compile time version. This is allowed by the at least the C99 standard (I've not looked in other one if it was modified)

6.3.1.8/2 in C99

The values of floating operands and of the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.

Edit: your gcc -S results confirm that. I haven't checked the computations, but the old one has (after substituting memory for its constant content)

movss 1053092943, %xmm1
movss 1055100473, %xmm0
call powf

calling powf with the precomputed values for 4/9.0 and 1/2.6 and then printing the result after promotion to double, while the new one just print the float 0x3f3b681f promoted to double.