Best questions in December 2011

Why is one loop so much slower than two loops?

381 votes

Suppose a1, b1, c1, and d1 point to heap memory and my numerical code has the following core loop.

const int n=100000

for(int j=0;j<n;j++){
    a1[j] += b1[j];
    c1[j] += d1[j];
}

This loop is executed 10,000 times via another outer for loop. To speed it up, I changed the code to:

for(int j=0;j<n;j++){
    a1[j] += b1[j];
}
for(int j=0;j<n;j++){
    c1[j] += d1[j];
}

Compiled on MS Visual C++ 10.0 with full optimization and SSE2 enabled for 32-bit on a Intel Core 2 Duo (x64), the first example takes 5.5 seconds and the double-loop example takes only 1.9 seconds. My question is: (Please refer to the my rephrased question at the bottom)

PS: I am not sure, if this helps:

Disassembly for the first loop basically looks like this (this block is repeated about five times in the full program):

movsd       xmm0,mmword ptr [edx+18h]
addsd       xmm0,mmword ptr [ecx+20h]
movsd       mmword ptr [ecx+20h],xmm0
movsd       xmm0,mmword ptr [esi+10h]
addsd       xmm0,mmword ptr [eax+30h]
movsd       mmword ptr [eax+30h],xmm0
movsd       xmm0,mmword ptr [edx+20h]
addsd       xmm0,mmword ptr [ecx+28h]
movsd       mmword ptr [ecx+28h],xmm0
movsd       xmm0,mmword ptr [esi+18h]
addsd       xmm0,mmword ptr [eax+38h]

Each loop of the double loop example produces this code (the following block is repeated about three times):

addsd       xmm0,mmword ptr [eax+28h]
movsd       mmword ptr [eax+28h],xmm0
movsd       xmm0,mmword ptr [ecx+20h]
addsd       xmm0,mmword ptr [eax+30h]
movsd       mmword ptr [eax+30h],xmm0
movsd       xmm0,mmword ptr [ecx+28h]
addsd       xmm0,mmword ptr [eax+38h]
movsd       mmword ptr [eax+38h],xmm0
movsd       xmm0,mmword ptr [ecx+30h]
addsd       xmm0,mmword ptr [eax+40h]
movsd       mmword ptr [eax+40h],xmm0

EDIT: The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:

Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?

It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.

PPS: The full code is at http://pastebin.com/ivzkuTzG. It uses TBB Tick_Count for higher resolution timing, which can be disabled by not defining the TBB_TIMING Macro.

(It shows FLOP/s for different values of n.)

enter image description here

Upon further analysis of this, I believe this is (at least partially) caused by data alignment of the four pointers. This will cause some level of cache bank/way conflicts.

If I've guessed correctly on how you are allocating your arrays, they are likely to be aligned to the page line.

This means that all your accesses in each loop will fall on the same cache way. However, Intel processors have had 8-way L1 cache associativity for a while. But in reality, the performance isn't completely uniform. Accessing 4-ways is still slower than say 2-ways.

EDIT : It does in fact look like you are allocating all the arrays separately. Usually when such large allocations are requested, the allocator will request fresh pages from the OS. Therefore, there is a high chance that large allocations will appear at the same offset from a page-boundary.

Here's the test code:

int main(){
    const int n = 100000;

#ifdef ALLOCATE_SEPERATE
    double *a1 = (double*)malloc(n * sizeof(double));
    double *b1 = (double*)malloc(n * sizeof(double));
    double *c1 = (double*)malloc(n * sizeof(double));
    double *d1 = (double*)malloc(n * sizeof(double));
#else
    double *a1 = (double*)malloc(n * sizeof(double) * 4);
    double *b1 = a1 + n;
    double *c1 = b1 + n;
    double *d1 = c1 + n;
#endif

    //  Zero the data to prevent any chance of denormals.
    memset(a1,0,n * sizeof(double));
    memset(b1,0,n * sizeof(double));
    memset(c1,0,n * sizeof(double));
    memset(d1,0,n * sizeof(double));

    //  Print the addresses
    cout << a1 << endl;
    cout << b1 << endl;
    cout << c1 << endl;
    cout << d1 << endl;

    clock_t start = clock();

    int c = 0;
    while (c++ < 10000){

#if ONE_LOOP
        for(int j=0;j<n;j++){
            a1[j] += b1[j];
            c1[j] += d1[j];
        }
#else
        for(int j=0;j<n;j++){
            a1[j] += b1[j];
        }
        for(int j=0;j<n;j++){
            c1[j] += d1[j];
        }
#endif

    }

    clock_t end = clock();
    cout << "seconds = " << (double)(end - start) / CLOCKS_PER_SEC << endl;

    system("pause");
    return 0;
}

Benchmark Results:

EDIT: Results on an actual Core 2 architecture machine:

2 x Intel Xeon X5482 Harpertown @ 3.2 GHz:

#define ALLOCATE_SEPERATE
#define ONE_LOOP
00600020
006D0020
007A0020
00870020
seconds = 6.206

#define ALLOCATE_SEPERATE
//#define ONE_LOOP
005E0020
006B0020
00780020
00850020
seconds = 2.116

//#define ALLOCATE_SEPERATE
#define ONE_LOOP
00570020
00633520
006F6A20
007B9F20
seconds = 1.894

//#define ALLOCATE_SEPERATE
//#define ONE_LOOP
008C0020
00983520
00A46A20
00B09F20
seconds = 1.993

Observations:

  • 6.206 seconds with one loop and 2.116 seconds with two loops. This reproduces the OP's results exactly.

  • In the first two tests, the arrays are allocated separately. You'll notice that they all have the same alignment relative to the page.

  • In the second two tests, the arrays are packed together to break that alignment. Here you'll notice both loops are faster. Furthermore, the second (double) loop is now the slower one as you would normally expect.

As @Stephen Cannon points out in the comments, there is very likely possibility that this alignment causes false aliasing in the load/store units or the cache. I Googled around for this and found that Intel actually has a hardware counter for partial address aliasing stalls:

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/lin/ug_docs/reference/pmw_dp/events/partial_address_alias.html


5 Regions - Explanations

Region 1:

This one is easy. The dataset is so small that the performance is dominated by overhead like looping and branching.

Region 2:

Here, as the data sizes increases, the amount of relative overhead goes down and the performance "saturates". Here two loops is slower because it has twice as much loop and branching overhead.

I'm not sure exactly what's going on here... Alignment could still play an effect as Agner Fog mentions cache bank conflicts. (That link is about Sandy Bridge, but the idea should still be applicable to Core 2.)

Region 3:

At this point, the data no longer fits in L1 cache. So performance is capped by the L1 <-> L2 cache bandwidth.

Region 4:

The performance drop in the single-loop is what we are observing. And as mentioned, this is due to the alignment which (most likely) causes false aliasing stalls in the processor load/store units.

However, in order for false aliasing to occur, there must be a large enough stride between the datasets. This is why you don't see this in region 3.

Region 5:

At this point, nothing fits in cache. So you're bound by memory bandwidth.


2 x Intel X5482 Harpertown @ 3.2 GHz Intel Core i7 870 @ 2.8 GHz Intel Core i7 2600K @ 4.4 GHz

Return type of '?:' (ternary conditional operator)

112 votes

Why does the first return a reference?

int x = 1;
int y = 2;
(x > y ? x : y) = 100;

While the second does not?

int x = 1;
long y = 2;
(x > y ? x : y) = 100;

Actually, the second did not compile at all - "not lvalue left of assignment".

Expressions don't have return types, they have a type and - as it's known in the latest C++ standard - a value category.

A conditional expression can be an lvalue or an rvalue. This is its value category. (This is somewhat of a simplification, in C++11 we have lvalues, xvalues and prvalues.)

In very broad and simple terms, an lvalue refers to an object in memory and an rvalue is just a value that may not necessarily be attached to an object in memory.

An assignment expression assigns a value to an object so the thing being assigned to must be an lvalue.

For a conditional expression (?:) to be an lvalue (again, in broad and simple terms), the second and third operands must be lvalues of the same type. This is because the type and value category of a conditional expression is determined at compile time and must be appropriate whether or not the condition is true. If one of the operands must be converted to a different type to match the other then the conditional expression cannot be an lvalue as the result of this conversion would not be an lvalue.

ISO/IEC 14882:2011 references:

3.10 [basic.lval] Lvalues and rvalues (about value categories)

5.15 [expr.cond] Conditional operator (rules for what type and value category a conditional expression has)

5.17 [expr.ass] Assignment and compound assignment operators (requirement that the l.h.s. of an assignment must be a modifiable lvalue)

What is the easiest way to make a C++ program crash?

95 votes

I'm trying to make a Python program that interfaces with a different crashy process (that's out of my hands). Unfortunately the program I'm interfacing with doesn't even crash reliably! So I want to make a quick C++ program that crashes on purpose but I don't actually know the best and shortest way to do that, does anyone know what to put between my:

int main() {
    crashyCodeGoesHere();
}

to make my C++ program crash reliably

The abort() function is probably your best bet. It's part of the C standard library, and is defined as "causing abnormal program termination" (e.g, a fatal error or crash).

how to achieve 4 flops per cycle

85 votes

How can the theoretical peak performance of 4 floating point operations (double precision) per cycle be achieved on a modern x86-64 Intel cpu?

As far as I understand does it take 3 cycles for an sse add and 5 cycles for a mul to complete on most of the modern Intel cpu's (see e.g. Agner Fog's http://agner.org/optimize/instruction_tables.pdf). Due to pipelining one can get a throughput of 1 add per cycle if the algorithm has at least 3 independent summations. Since that is true for packed addpd as well as the scalar addsd versions and sse registers can contain 2 double's the throughput can be as much as 2 flops per cycle. Furthermore it seems (although I've not seen any proper doc on this) add's and mul's can be executed in parallel giving a theoretical max throughput of 4 flops per cycle.

However, I've not been able to replicate that performance with a simple c/c++ programme. My best attempt results in about 2.7 flops/cycle. If anyone can contribute a simple c/c++ or assembler programme which demonstrates peak performance that'd be greatly appreciated.

My attempt:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>

double stoptime(void) {
   struct timeval t;
   gettimeofday(&t,NULL);
   return (double) t.tv_sec + t.tv_usec/1000000.0;
}

double addmul(double add, double mul, int ops){
   // need to initialise differently otherwise compiler might optimise away
   double sum1=0.1, sum2=-0.1, sum3=0.2, sum4=-0.2, sum5=0.0;
   double mul1=1.0, mul2= 1.1, mul3=1.2, mul4= 1.3, mul5=1.4;
   int loops=ops/10;          // we have 10 floating point ops inside the loop
   double expected = 5.0*add*loops + (sum1+sum2+sum3+sum4+sum5)
               + pow(mul,loops)*(mul1+mul2+mul3+mul4+mul5);

   for(int i=0; i<loops; i++) {
      mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
      sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
   }
   return  sum1+sum2+sum3+sum4+sum5+mul1+mul2+mul3+mul4+mul5 - expected;
}

int main(int argc, char** argv) {
   if(argc!=2) {
      printf("usage: %s <num>\n", argv[0]);
      printf("number of operations: <num> millions\n");
      exit(EXIT_FAILURE);
   }
   int n=atoi(argv[1])*1000000;
   if(n<=0) n=1000;

   double x=M_PI;
   double y=1.0+1e-8;
   double t=stoptime();
   x=addmul(x,y,n);
   t=stoptime()-t;
   printf("addmul:\t %.3f s, %.3f Gflops, res=%f\n",t,(double)n/t/1e9,x);

   return EXIT_SUCCESS;
}

Compiled with

g++ -O2 -march=native addmul.cpp ; ./a.out 1000

gives on an Intel Core i5-750, 2.66 GHz

addmul:  0.270 s, 3.707 Gflops, res=1.326463

i.e. just about 1.4 flops per cycle. Looking at the assembler code with g++ -S -O2 -march=native -masm=intel addmul.cpp the main loop seems kind of optimal to me:

.L4:
inc eax
mulsd   xmm8, xmm3
mulsd   xmm7, xmm3
mulsd   xmm6, xmm3
mulsd   xmm5, xmm3
mulsd   xmm1, xmm3
addsd   xmm13, xmm2
addsd   xmm12, xmm2
addsd   xmm11, xmm2
addsd   xmm10, xmm2
addsd   xmm9, xmm2
cmp eax, ebx
jne .L4

Changing the scalar versions with packed versions (addpd and mulpd) would double the flop count without changing the execution time and so I'd get just short of 2.8 flops per cycle. Any simple example which achieves 4 flops per cycle?

Edit:

Nice little programme by Mysticial, here are my results (run just for a few seconds though):

  • gcc -O2 -march=nocona: 5.6 Gflops out of 10.66 Gflops (2.1 flops/cycle)
  • cl /O2, openmp removed: 10.1 Gflops out of 10.66 Gflops (3.8 flops/cycle)

It all seems a bit complex but my conclusions so far:

  • gcc -O2 changes the order of independent floating point operations with the aim of alternating addpd and mulpd's if possible. Same applies to gcc-4.6.2 -O2 -march=core2.

  • gcc -O2 -march=nocona seems to keep the order of fp operations as defined in the C++ source.

  • cl /O2, the 64-bit compiler from the SDK for Windows 7 does loop-unrolling automatically and seems to try and arrange operations so that groups of 3 addpd's alternate with 3 mulpd's (well at least on my system and for my simple programme).

  • My Core i5 750 (Nahelem architecture) doesn't like alternating add's and mul's and seems unable to run both ops in parallel. However, if grouped in 3's it suddenly works like magic.

  • Other architectures (possibly Sandy Bridge and others) appear to be able to execute add/mul in parallel without problems if they alternate in the assembly code.

  • Although difficult to admit, but on my system cl /O2 does a much better job at low level optimising operations for my system and achieves close to peak performance for the little c++ example above. I measured between 1.85-2.01 flops/cycle (have used clock() in Windows which is not that precise I guess, need to use a better timer - thanks Mackie Messer).

  • The best I managed with gcc was to manually loop unroll and arrange additions and multiplications in groups of three. With g++ -O2 -march=nocona addmul_unroll.cpp I get at best 0.207s, 4.825 Gflops which corresponds to 1.8 flops/cycle which I'm quite happy with now.

In the c++ code I've replaced the for loop with

   for(int i=0; i<loops/3; i++) {
      mul1*=mul; mul2*=mul; mul3*=mul;
      sum1+=add; sum2+=add; sum3+=add;
      mul4*=mul; mul5*=mul; mul1*=mul;
      sum4+=add; sum5+=add; sum1+=add;

      mul2*=mul; mul3*=mul; mul4*=mul;
      sum2+=add; sum3+=add; sum4+=add;
      mul5*=mul; mul1*=mul; mul2*=mul;
      sum5+=add; sum1+=add; sum2+=add;

      mul3*=mul; mul4*=mul; mul5*=mul;
      sum3+=add; sum4+=add; sum5+=add;
   }

and the assembly now looks like

.L4:
mulsd   xmm8, xmm3
mulsd   xmm7, xmm3
mulsd   xmm6, xmm3
addsd   xmm13, xmm2
addsd   xmm12, xmm2
addsd   xmm11, xmm2
mulsd   xmm5, xmm3
mulsd   xmm1, xmm3
mulsd   xmm8, xmm3
addsd   xmm10, xmm2
addsd   xmm9, xmm2
addsd   xmm13, xmm2
...

I've done this exact task before. But it was mainly to measure power consumption and CPU temperatures. The following code (which is fairly long) achieves close to optimal on my Core i7 2600K.

The key thing to note here is the massive amount of manual loop-unrolling as well as interleaving of multiplies and adds...

Note that this is optimized for x64. x86 doesn't have enough registers for this to compile well.

Warning:

If you decide to compile and run this, pay attention to your CPU temperatures!!! Make sure you don't overheat it. And make sure CPU-throttling doesn't affect your results!

double test_dp_mac_SSE(double x,double y,size_t iterations){
    register __m128d r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,rA,rB,rC,rD,rE,rF;

    //  Generate starting data.
    r0 = _mm_set1_pd(x);
    r1 = _mm_set1_pd(y);

    r8 = _mm_set1_pd(-0.0);

    r2 = _mm_xor_pd(r0,r8);
    r3 = _mm_or_pd(r0,r8);
    r4 = _mm_andnot_pd(r8,r0);
    r5 = _mm_mul_pd(r1,_mm_set1_pd(0.37796447300922722721));
    r6 = _mm_mul_pd(r1,_mm_set1_pd(0.24253562503633297352));
    r7 = _mm_mul_pd(r1,_mm_set1_pd(4.1231056256176605498));
    r8 = _mm_add_pd(r0,_mm_set1_pd(0.37796447300922722721));
    r9 = _mm_add_pd(r1,_mm_set1_pd(0.24253562503633297352));
    rA = _mm_sub_pd(r0,_mm_set1_pd(4.1231056256176605498));
    rB = _mm_sub_pd(r1,_mm_set1_pd(4.1231056256176605498));

    rC = _mm_set1_pd(1.4142135623730950488);
    rD = _mm_set1_pd(1.7320508075688772935);
    rE = _mm_set1_pd(0.57735026918962576451);
    rF = _mm_set1_pd(0.70710678118654752440);

    uint64 iMASK = 0x800fffffffffffffull;
    __m128d MASK = _mm_set1_pd(*(double*)&iMASK);
    __m128d vONE = _mm_set1_pd(1.0);

    size_t c = 0;
    while (c < iterations){
        size_t i = 0;
        while (i < 1000){
            //  Here's the meat - the part that really matters.

            r0 = _mm_mul_pd(r0,rC);
            r1 = _mm_add_pd(r1,rD);
            r2 = _mm_mul_pd(r2,rE);
            r3 = _mm_sub_pd(r3,rF);
            r4 = _mm_mul_pd(r4,rC);
            r5 = _mm_add_pd(r5,rD);
            r6 = _mm_mul_pd(r6,rE);
            r7 = _mm_sub_pd(r7,rF);
            r8 = _mm_mul_pd(r8,rC);
            r9 = _mm_add_pd(r9,rD);
            rA = _mm_mul_pd(rA,rE);
            rB = _mm_sub_pd(rB,rF);

            r0 = _mm_add_pd(r0,rF);
            r1 = _mm_mul_pd(r1,rE);
            r2 = _mm_sub_pd(r2,rD);
            r3 = _mm_mul_pd(r3,rC);
            r4 = _mm_add_pd(r4,rF);
            r5 = _mm_mul_pd(r5,rE);
            r6 = _mm_sub_pd(r6,rD);
            r7 = _mm_mul_pd(r7,rC);
            r8 = _mm_add_pd(r8,rF);
            r9 = _mm_mul_pd(r9,rE);
            rA = _mm_sub_pd(rA,rD);
            rB = _mm_mul_pd(rB,rC);

            r0 = _mm_mul_pd(r0,rC);
            r1 = _mm_add_pd(r1,rD);
            r2 = _mm_mul_pd(r2,rE);
            r3 = _mm_sub_pd(r3,rF);
            r4 = _mm_mul_pd(r4,rC);
            r5 = _mm_add_pd(r5,rD);
            r6 = _mm_mul_pd(r6,rE);
            r7 = _mm_sub_pd(r7,rF);
            r8 = _mm_mul_pd(r8,rC);
            r9 = _mm_add_pd(r9,rD);
            rA = _mm_mul_pd(rA,rE);
            rB = _mm_sub_pd(rB,rF);

            r0 = _mm_add_pd(r0,rF);
            r1 = _mm_mul_pd(r1,rE);
            r2 = _mm_sub_pd(r2,rD);
            r3 = _mm_mul_pd(r3,rC);
            r4 = _mm_add_pd(r4,rF);
            r5 = _mm_mul_pd(r5,rE);
            r6 = _mm_sub_pd(r6,rD);
            r7 = _mm_mul_pd(r7,rC);
            r8 = _mm_add_pd(r8,rF);
            r9 = _mm_mul_pd(r9,rE);
            rA = _mm_sub_pd(rA,rD);
            rB = _mm_mul_pd(rB,rC);

            i++;
        }

        //  Need to renormalize to prevent denormal/overflow.
        r0 = _mm_and_pd(r0,MASK);
        r1 = _mm_and_pd(r1,MASK);
        r2 = _mm_and_pd(r2,MASK);
        r3 = _mm_and_pd(r3,MASK);
        r4 = _mm_and_pd(r4,MASK);
        r5 = _mm_and_pd(r5,MASK);
        r6 = _mm_and_pd(r6,MASK);
        r7 = _mm_and_pd(r7,MASK);
        r8 = _mm_and_pd(r8,MASK);
        r9 = _mm_and_pd(r9,MASK);
        rA = _mm_and_pd(rA,MASK);
        rB = _mm_and_pd(rB,MASK);
        r0 = _mm_or_pd(r0,vONE);
        r1 = _mm_or_pd(r1,vONE);
        r2 = _mm_or_pd(r2,vONE);
        r3 = _mm_or_pd(r3,vONE);
        r4 = _mm_or_pd(r4,vONE);
        r5 = _mm_or_pd(r5,vONE);
        r6 = _mm_or_pd(r6,vONE);
        r7 = _mm_or_pd(r7,vONE);
        r8 = _mm_or_pd(r8,vONE);
        r9 = _mm_or_pd(r9,vONE);
        rA = _mm_or_pd(rA,vONE);
        rB = _mm_or_pd(rB,vONE);

        c++;
    }

    r0 = _mm_add_pd(r0,r1);
    r2 = _mm_add_pd(r2,r3);
    r4 = _mm_add_pd(r4,r5);
    r6 = _mm_add_pd(r6,r7);
    r8 = _mm_add_pd(r8,r9);
    rA = _mm_add_pd(rA,rB);

    r0 = _mm_add_pd(r0,r2);
    r4 = _mm_add_pd(r4,r6);
    r8 = _mm_add_pd(r8,rA);

    r0 = _mm_add_pd(r0,r4);
    r0 = _mm_add_pd(r0,r8);

    //  Prevent Dead Code Elimination
    double out = 0;
    out += ((double*)&r0)[0];
    out += ((double*)&r0)[1];

    return out;
}

void test_dp_mac_SSE(int tds,size_t iterations){

    double *sum = (double*)malloc(tds * sizeof(double));
    wclk start = wclk_now();

#pragma omp parallel num_threads(tds)
    {
        double ret = test_dp_mac_SSE(1.1,2.1,iterations);
        sum[omp_get_thread_num()] = ret;
    }

    double secs = wclk_secs_since(start);
    uint64 ops = 48 * 1000 * iterations * tds * 2;
    cout << "Seconds = " << secs << endl;
    cout << "FP Ops  = " << ops << endl;
    cout << "FLOPs   = " << ops / secs << endl;

    double out = 0;
    int c = 0;
    while (c < tds){
        out += sum[c++];
    }

    cout << "sum = " << out << endl;
    cout << endl;

    free(sum);
}

Output (1 thread, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

Seconds = 55.5104
FP Ops  = 960000000000
FLOPs   = 1.7294e+010
sum = 2.22652

The machine is a Core i7 2600K @ 4.4 GHz. Theoretical SSE peak is 4 flops * 4.4 GHz = 17.6 GFlops. This code achieves 17.3 GFlops - not bad.

Output (8 threads, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

Seconds = 117.202
FP Ops  = 7680000000000
FLOPs   = 6.55279e+010
sum = 17.8122

Theoretical SSE peak is 4 flops * 4 cores * 4.4 GHz = 70.4 GFlops. Actual is 65.5 GFlops.


Let's take this one step further. AVX...

double test_dp_mac_AVX(double x,double y,size_t iterations){
    register __m256d r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,rA,rB,rC,rD,rE,rF;

    //  Generate starting data.
    r0 = _mm256_set1_pd(x);
    r1 = _mm256_set1_pd(y);

    r8 = _mm256_set1_pd(-0.0);

    r2 = _mm256_xor_pd(r0,r8);
    r3 = _mm256_or_pd(r0,r8);
    r4 = _mm256_andnot_pd(r8,r0);
    r5 = _mm256_mul_pd(r1,_mm256_set1_pd(0.37796447300922722721));
    r6 = _mm256_mul_pd(r1,_mm256_set1_pd(0.24253562503633297352));
    r7 = _mm256_mul_pd(r1,_mm256_set1_pd(4.1231056256176605498));
    r8 = _mm256_add_pd(r0,_mm256_set1_pd(0.37796447300922722721));
    r9 = _mm256_add_pd(r1,_mm256_set1_pd(0.24253562503633297352));
    rA = _mm256_sub_pd(r0,_mm256_set1_pd(4.1231056256176605498));
    rB = _mm256_sub_pd(r1,_mm256_set1_pd(4.1231056256176605498));

    rC = _mm256_set1_pd(1.4142135623730950488);
    rD = _mm256_set1_pd(1.7320508075688772935);
    rE = _mm256_set1_pd(0.57735026918962576451);
    rF = _mm256_set1_pd(0.70710678118654752440);

    uint64 iMASK = 0x800fffffffffffffull;
    __m256d MASK = _mm256_set1_pd(*(double*)&iMASK);
    __m256d vONE = _mm256_set1_pd(1.0);

    size_t c = 0;
    while (c < iterations){
        size_t i = 0;
        while (i < 1000){
            //  Here's the meat - the part that really matters.

            r0 = _mm256_mul_pd(r0,rC);
            r1 = _mm256_add_pd(r1,rD);
            r2 = _mm256_mul_pd(r2,rE);
            r3 = _mm256_sub_pd(r3,rF);
            r4 = _mm256_mul_pd(r4,rC);
            r5 = _mm256_add_pd(r5,rD);
            r6 = _mm256_mul_pd(r6,rE);
            r7 = _mm256_sub_pd(r7,rF);
            r8 = _mm256_mul_pd(r8,rC);
            r9 = _mm256_add_pd(r9,rD);
            rA = _mm256_mul_pd(rA,rE);
            rB = _mm256_sub_pd(rB,rF);

            r0 = _mm256_add_pd(r0,rF);
            r1 = _mm256_mul_pd(r1,rE);
            r2 = _mm256_sub_pd(r2,rD);
            r3 = _mm256_mul_pd(r3,rC);
            r4 = _mm256_add_pd(r4,rF);
            r5 = _mm256_mul_pd(r5,rE);
            r6 = _mm256_sub_pd(r6,rD);
            r7 = _mm256_mul_pd(r7,rC);
            r8 = _mm256_add_pd(r8,rF);
            r9 = _mm256_mul_pd(r9,rE);
            rA = _mm256_sub_pd(rA,rD);
            rB = _mm256_mul_pd(rB,rC);

            r0 = _mm256_mul_pd(r0,rC);
            r1 = _mm256_add_pd(r1,rD);
            r2 = _mm256_mul_pd(r2,rE);
            r3 = _mm256_sub_pd(r3,rF);
            r4 = _mm256_mul_pd(r4,rC);
            r5 = _mm256_add_pd(r5,rD);
            r6 = _mm256_mul_pd(r6,rE);
            r7 = _mm256_sub_pd(r7,rF);
            r8 = _mm256_mul_pd(r8,rC);
            r9 = _mm256_add_pd(r9,rD);
            rA = _mm256_mul_pd(rA,rE);
            rB = _mm256_sub_pd(rB,rF);

            r0 = _mm256_add_pd(r0,rF);
            r1 = _mm256_mul_pd(r1,rE);
            r2 = _mm256_sub_pd(r2,rD);
            r3 = _mm256_mul_pd(r3,rC);
            r4 = _mm256_add_pd(r4,rF);
            r5 = _mm256_mul_pd(r5,rE);
            r6 = _mm256_sub_pd(r6,rD);
            r7 = _mm256_mul_pd(r7,rC);
            r8 = _mm256_add_pd(r8,rF);
            r9 = _mm256_mul_pd(r9,rE);
            rA = _mm256_sub_pd(rA,rD);
            rB = _mm256_mul_pd(rB,rC);

            i++;
        }

        //  Need to renormalize to prevent denormal/overflow.
        r0 = _mm256_and_pd(r0,MASK);
        r1 = _mm256_and_pd(r1,MASK);
        r2 = _mm256_and_pd(r2,MASK);
        r3 = _mm256_and_pd(r3,MASK);
        r4 = _mm256_and_pd(r4,MASK);
        r5 = _mm256_and_pd(r5,MASK);
        r6 = _mm256_and_pd(r6,MASK);
        r7 = _mm256_and_pd(r7,MASK);
        r8 = _mm256_and_pd(r8,MASK);
        r9 = _mm256_and_pd(r9,MASK);
        rA = _mm256_and_pd(rA,MASK);
        rB = _mm256_and_pd(rB,MASK);
        r0 = _mm256_or_pd(r0,vONE);
        r1 = _mm256_or_pd(r1,vONE);
        r2 = _mm256_or_pd(r2,vONE);
        r3 = _mm256_or_pd(r3,vONE);
        r4 = _mm256_or_pd(r4,vONE);
        r5 = _mm256_or_pd(r5,vONE);
        r6 = _mm256_or_pd(r6,vONE);
        r7 = _mm256_or_pd(r7,vONE);
        r8 = _mm256_or_pd(r8,vONE);
        r9 = _mm256_or_pd(r9,vONE);
        rA = _mm256_or_pd(rA,vONE);
        rB = _mm256_or_pd(rB,vONE);

        c++;
    }

    r0 = _mm256_add_pd(r0,r1);
    r2 = _mm256_add_pd(r2,r3);
    r4 = _mm256_add_pd(r4,r5);
    r6 = _mm256_add_pd(r6,r7);
    r8 = _mm256_add_pd(r8,r9);
    rA = _mm256_add_pd(rA,rB);

    r0 = _mm256_add_pd(r0,r2);
    r4 = _mm256_add_pd(r4,r6);
    r8 = _mm256_add_pd(r8,rA);

    r0 = _mm256_add_pd(r0,r4);
    r0 = _mm256_add_pd(r0,r8);

    //  Prevent Dead Code Elimination
    double out = 0;
    out += ((double*)&r0)[0];
    out += ((double*)&r0)[1];
    out += ((double*)&r0)[2];
    out += ((double*)&r0)[3];

    return out;
}

void test_dp_mac_AVX(int tds,size_t iterations){

    double *sum = (double*)malloc(tds * sizeof(double));
    wclk start = wclk_now();

#pragma omp parallel num_threads(tds)
    {
        double ret = test_dp_mac_AVX(1.1,2.1,iterations);
        sum[omp_get_thread_num()] = ret;
    }

    double secs = wclk_secs_since(start);
    uint64 ops = 48 * 1000 * iterations * tds * 4;
    cout << "Seconds = " << secs << endl;
    cout << "FP Ops  = " << ops << endl;
    cout << "FLOPs   = " << ops / secs << endl;

    double out = 0;
    int c = 0;
    while (c < tds){
        out += sum[c++];
    }

    cout << "sum = " << out << endl;
    cout << endl;

    free(sum);
}

Output (1 thread, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

Seconds = 57.4679
FP Ops  = 1920000000000
FLOPs   = 3.34099e+010
sum = 4.45305

Theoretical AVX peak is 8 flops * 4.4 GHz = 35.2 GFlops. Actual is 33.4 GFlops.

Output (8 threads, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

Seconds = 111.119
FP Ops  = 15360000000000
FLOPs   = 1.3823e+011
sum = 35.6244

Theoretical AVX peak is 8 flops * 4 cores * 4.4 GHz = 140.8 GFlops. Actual is 138.2 GFlops.


Now for some explanations:

The performance critical part is obviously the 48 instructions inside the inner loop. You'll notice that it's broken into 4 blocks of 12 instructions each. Each of these 12 instructions blocks are completely independent from each other - and take on average 6 cycles to execute.

So there's 12 instructions and 6 cycles between issue-to-use. The latency of multiplication is 5 cycles, so it's just enough to avoid latency stalls.

The normalization step is needed to keep the data from over/underflowing. This is needed since the do-nothing code will slowly increase/decrease the magnitude of the data.

So it's actually possible to do better than this if you just use all zeros and get rid of the normalization step. However, since I wrote the benchmark to measure power consumption and temperature, I had to make sure the flops were on "real" data, rather than zeros - as the execution units may very well have special case-handling for zeros that use less power and produce less heat.


More Results:

  • Intel Core i7 920 @ 3.5 GHz
  • Windows 7 Ultimate x64
  • Visual Studio 2010 SP1 - x64 Release

Threads: 1

Seconds = 72.1116
FP Ops  = 960000000000
FLOPs   = 1.33127e+010
sum = 2.22652

Theoretical SSE Peak: 4 flops * 3.5 GHz = 14.0 GFlops. Actual is 13.3 GFlops.

Threads: 8

Seconds = 149.576
FP Ops  = 7680000000000
FLOPs   = 5.13452e+010
sum = 17.8122

Theoretical SSE Peak: 4 flops * 4 cores * 3.5 GHz = 56.0 GFlops. Actual is 51.3 GFlops.

My processor temps hit 76C on the multi-threaded run! If you runs these, be sure the results aren't affected by CPU throttling.


  • 2 x Intel Xeon X5482 Harpertown @ 3.2 GHz
  • Ubuntu Linux 10 x64
  • GCC 4.5.2 x64 - (-O2 -msse3 -fopenmp)

Threads: 1

Seconds = 78.3357
FP Ops  = 960000000000
FLOPs   = 1.22549e+10
sum = 2.22652

Theoretical SSE Peak: 4 flops * 3.2 GHz = 12.8 GFlops. Actual is 12.3 GFlops.

Threads: 8

Seconds = 78.4733
FP Ops  = 7680000000000
FLOPs   = 9.78676e+10
sum = 17.8122

Theoretical SSE Peak: 4 flops * 8 cores * 3.2 GHz = 102.4 GFlops. Actual is 97.9 GFlops.

What's wrong with this 1988 C code?

Asked on Tue, 27 Dec 2011 by César c
76 votes

I'm trying to compile this piece of code from the book "The C Programming Language" (K & R). It is a bare-bones version of the UNIX program wc:

#include <stdio.h>

#define IN   1;     /* inside a word */
#define OUT  0;     /* outside a word */

/* count lines, words and characters in input */
main() {
    int c, nl, nw, nc, state;

    state = OUT;
    nl = nw = nc = 0;
    while ((c = getchar()) != EOF) {
        ++nc;
        if (c == '\n')
            ++nl;
        if (c == ' ' || c == '\n' || c == '\t')
            state = OUT;
        else if (state == OUT) {
            state = IN;
            ++nw;
        }
    }
    printf("%d %d %d\n", nl, nw, nc);
}

And I'm getting the following error:

$ gcc wc.c 
wc.c: In function ‘main’:
wc.c:18: error: ‘else’ without a previous ‘if’
wc.c:18: error: expected ‘)’ before ‘;’ token

The 2nd edition of this book is from 1988 and I'm pretty new to C. Maybe it has to do with the compiler version or maybe I'm just talking nonsense.

I've seen in modern C code a different use of the main function:

int main() {
    /* code */
    return 0;
}

Is this a new standard or can I still use a type-less main?

Your problem is with your preprocessor definitions of IN and OUT:

#define IN   1;     /* inside a word */
#define OUT  0;     /* outside a word */

Notice how you have a trailing semicolon in each of these. When the preprocessor expands them, your code will look roughly like:

    if (c == ' ' || c == '\n' || c == '\t')
        state = 0;; /* <--PROBLEM #1 */
    else if (state == 0;) { /* <--PROBLEM #2 */
        state = 1;;

That second semicolon causes the else to have no previous if as a match, because you are not using braces. So, remove the semicolons from the preprocessor definitions of IN and OUT.

The lesson learned here is that preprocessor statements do not have to end with a semicolon.

Also, you should always use braces!

    if (c == ' ' || c == '\n' || c == '\t') {
        state = OUT;
    } else if (state == OUT) {
        state = IN;
        ++nw;
    }

There is no hanging-else ambiguity in the above code.

Does inverting the "if" improve performance?

75 votes

I've been using ReSharper for a while now and sometimes it suggests that I invert the if. I guess an example would be a better explanation of my situation:

public void myfunction(int exampleParam){
    if(exampleParam > 0){
        //Do something with form controls for example.
    }
}

Now ReSharper suggests that I invert the if to this:

public void myfunction(int exampleParam){
    if(exampleParam <= 0)
        return;
    //Do something with form controls for example
}

Does this "improve" performance somehow? Or is it just aesthetic?

It is not only aesthetic, but it also reduces the maximum nesting level inside the method. This is generally regarded as a plus because it makes methods easier to understand (and indeed, many static analysis tools provide a measure of this as one of the indicators of code quality).

On the other hand, it also makes your method have multiple exit points, something that another group of people believes is a no-no.

Personally, I agree with ReSharper and the first group (in a language that has exceptions I find it silly to discuss "multiple exit points"; almost anything can throw, so there are numerous potential exit points in all methods).

Regarding performance: both versions should be equivalent (if not at the IL level, then certainly after the jitter is through with the code) in every language. Theoretically this depends on the compiler, but practically any widely used compiler of today is capable of handling much more advanced cases of code optimization than this.

Infinite loops in Java

61 votes

Look at the following infinite while loop in Java. It causes a compile-time error for the statement below it.

while(true)
{
     System.out.println("inside while");
}

System.out.println("while terminated"); //Unreachable statement - compiler-error.

The following same infinite while loop, however works fine and doesn't issue any errors in which I just replaced the condition with a boolean variable.

boolean b=true;

while(b)
{
     System.out.println("inside while");
}

System.out.println("while terminated"); //No error here.

In the second case also, the statement after the loop is obviously unreachable because the boolean variable b is true still the compiler doesn't complain at all. Why?


Edit : The following version of while gets stuck into an infinite loop as obvious but issues no compiler errors for the statement below it even though the if condition within the loop is always false and consequently, the loop can never return and can be determined by the compiler at the compile-time itself.

while(true)
{
    if(false)
    {
        break;
    }

    System.out.println("inside while");
}

System.out.println("while terminated"); //No error here.

while(true)
{
    if(false)  //if true then also
    {
        return;  //Replacing return with break fixes the following error.
    }

    System.out.println("inside while");
}

System.out.println("while terminated"); //Compiler-error - unreachable statement.

while(true)
{
    if(true)
    {
        System.out.println("inside if");
        return;
    }

    System.out.println("inside while"); //No error here.
}

System.out.println("while terminated"); //Compiler-error - unreachable statement.

Edit : Same thing with if and while.

if(false)
{
    System.out.println("inside if"); //No error here.
}

while(false)
{
    System.out.println("inside while"); //Compiler's complain - unreachable statement.
}

while(true)
{
    if(true)
    {
        System.out.println("inside if");
        break;
    }

    System.out.println("inside while"); //No error here.
}      

The following version of while also gets stuck into an infinite loop.

while(true)
{
    try
    {
        System.out.println("inside while");
        return;   //Replacing return with break makes no difference here.
    }
    finally
    {
        continue;
    }
}

This is because the finally block is always executed even though the return statement encounters before it within the try block itself.

The compiler can easily and unequivocally prove that the first expression always results in an infinite loop, but it's not as easy for the second. In your toy example it's simple, but what if:

  • the variable's contents were read from a file?
  • the variable wasn't local and could be modified by another thread?
  • the variable relied on some user input?

The compiler is clearly not checking for your simpler case because it's forgoing that road altogether. Why? Because it's much harder forbidden by the spec. See section 14.20:

(By the way, my compiler does complain when the variable is declared final.)

Anagram of a palindrome

49 votes

I just had a technical job interview and in this process I had to solve this problem:

A string is a palindrome if it has exactly the same sequence of characters when read left-to-right as it has when read right-to-left. For example, the following strings are palindromes:

"kayak",
"Rats live on no evil star",
"Able was I ere I saw Elba.

A string A is an anagram of a string B if A can be obtained from B by rearranging the characters. For example, in each of the following pairs one string is an anagram of the other:

"mary" and "army",
"rocketboys" and "octobersky",

Write a function:

class Solution { public int isAnagramOfPalindrome(String S); }

that, given a non-empty string S consisting of N characters, returns 1 if S is an anagram of some palindrome and returns 0 otherwise.

Assume that:

N is an integer within the range [1..100,000];
string S consists only of English lower-case letters (a-z).

For example, given S = "dooernedeevrvn", the function should return 1, because "dooernedeevrvn" is an anagram of the palindrome "neveroddoreven". Given S = "aabcba", the function should return 0.

Complexity:

expected worst-case time complexity is O(N);
expected worst-case space complexity is O(1) (not counting the storage required for input arguments).

I totally failed the interview. I was not ready for those kind of questions but I still want to know the solution. My idea was to create 2 utility functions isAnagram which checks if String A is a anagram of String B and createPalindrom which creates a palindrom from a given String :

public class Solution {
      public int isAnagramofPalindrom(String s){
          if(isAnagram(s, createPalindrom(s))) return 1;
          return 0;     
      }

      public boolean isAnagram(String A, String B){
             boolean yes = false ;
             for( int i = 0; i < A.lenth(); i++){
                yes = yes && exist(A.chartAt(i), B); //I know this is not optimized
             }
          return yes;    
      }

      public boolean exist(char c, String b){
        boolean found = true;
        int i = 0;
        while( found = (b.chartAt(i) != c)){
          i++;
        }
        return found;
      }

      public String createPalindrom(String s){
        //?????
      }


   } 

My problem was how to create the palindrom. Maybe, I get this wrong or maybe I had a bad approach of this problem. What is your point and how should I solve this problem?

Thx for your advice

It's quite simple actually. If string has an even number of chars, just verify that every char in the string is present an even number of times, if string is odd, you're allowed one (and only one) unique char odd number of times (for the middle of the palindrome), rest of the characters should be even.

Why would adding a method add an ambiguous call, if it wouldn't be involved in the ambiguity

45 votes

I have this class

public class Overloaded
{
    public void ComplexOverloadResolution(params string[] something)
    {
        Console.WriteLine("Normal Winner");
    }

    public void ComplexOverloadResolution<M>(M something)
    {
        Console.WriteLine("Confused");
    }
}

If I call it like this:

        var blah = new Overloaded();
        blah.ComplexOverloadResolution("Which wins?");

It writes Normal Winner to the console.

But, if I add another method:

    public void ComplexOverloadResolution(string something, object somethingElse = null)
    {
        Console.WriteLine("Added Later");
    }

I get the following error:

The call is ambiguous between the following methods or properties: > 'Overloaded.ComplexOverloadResolution(params string[])' and 'Overloaded.ComplexOverloadResolution<string>(string)'

I can understand that adding a method might introduce a call ambiguity, but it's an ambiguity between the two methods that already existed (params string[]) and <string>(string)! Clearly neither of the two methods involved in the ambiguity is the newly added method, because the first is a params, and the second is a generic.

Is this a bug? What part of the spec says that this should be the case?

Is this a bug?

Yes.

Congratulations, you have found a bug in overload resolution. I can reproduce the bug with C# 4; it does not reproduce in Roslyn. It possibly reproduces in C# 5.

I am about to head out for the rest of the year; I will bring it to the attention of the C# 5 testers in January and we'll see if it reproduces in C# 5 and if we can get it fixed.

A correct analysis follows. The candidates are:

0: C(params string[]) in its normal form
1: C(params string[]) in its expanded form
2: C<string>(string) 
3: C(string, object) 

Candidate zero is obviously inapplicable because string is not convertible to string[]. That leaves three.

Of the three, we must determine a unique best method. We do so by making pairwise comparisons of the three remaining candidates. There are three such pairs. All of them have identical parameter lists once we strip off the omitted optional parameters, which means that we have to go to the advanced tiebreaking round described in section 7.5.3.2 of the specification.

Which is better, 1 or 2? The relevant tiebreaker is that a generic method is always worse than a non-generic method. 2 is worse than 1. So 2 cannot be the winner.

Which is better, 1 or 3? The relevant tiebreaker is: a method applicable only in its expanded form is always worse than a method applicable in its normal form. Therefore 1 is worse than 3. So 1 cannot be the winner.

Which is better, 2 or 3? The relevant tiebreaker is that a generic method is always worse than a non-generic method. 2 is worse than 3. So 2 cannot be the winner.

To be chosen from a set of multiple applicable candidates a candidate must be (1) unbeaten, (2) beat at least one other candidate, and (3) be the unique candidate that has the first two properties. Candidate three is beaten by no other candidate, and beats at least one other candidate; it is the only candidate with this property. Therefore candidate three is the unique best candidate. It should win.

Not only is the C# 4 compiler getting it wrong, as you correctly note it is reporting a bizarre error message. That the compiler is getting the overload resolution analysis wrong is a little bit surprising. That it is getting the error message wrong is completely unsurprising; the "ambiguous method" error heuristic basically picks any two methods from the candidate set if a best method cannot be determined. It is not very good at finding the "real" ambiguity, if in fact there is one.

One might reasonably ask why that is. It is quite tricky to find two methods that are "unambigously ambiguous" because the "betterness" relation is intransitive. It is possible to come up with situations where candidate 1 is better than 2, 2 is better than 3, and 3 is better than 1. In such situations we cannot do better than picking two of them as "the ambiguous ones".

I would like to improve this heuristic for Roslyn but it is a low priority.

(Exercise to the reader: "Devise a linear-time algorithm to identify the unique best member of a set of n elements where the betterness relation is intransitive" was one of the questions I was asked the day I interviewed for this team. It's not a very hard algorithm; give it a shot.)

One of the reasons why we pushed back on adding optional arguments to C# for so long was the number of complex ambiguous situations it introduces into the overload resolution algorithm; apparently we did not get it right.

If you would like to enter a Connect issue to track it, feel free. If you just want it brought to our attention, consider it done. I'll follow up with testing next year.

Thanks for bringing this to my attention. Apologies for the error.

Enhanced FOR loops in C++

44 votes

I am switching from Java to C++ and I was wondering whether C++ contains the enhanced for loops that I used in java, in example:

int[] numbers = {1,2,3,4,5,6,7,8,9,10};
for (int item : numbers) {
  System.out.println("Count is: " + item);
}

Is this same "shortcut" possible in C++?

In C++11, if your compiler supports it, yes it is. It's called range-based for.

std::vector<int> v;

// fill vector

for (const int& i : v) { std::cout << i << "\n"; }

It works for C style arrays and any type that has functions begin() and end() that return iterators. Example:

class test {
    int* array;
    size_t size;
public:
    test(size_t n) : array(new int[n]), size(n)
    {
        for (int i = 0; i < n; i++) { array[i] = i; }
    }
    ~test() { delete [] array; }
    int* begin() { return array; }
    int* end() { return array + size; }
};

int main()
{
    test T(10);
    for (auto& i : T) {
        std::cout << i;   // prints 0123456789
    }
}

Any reason to write the "private" keyword in C#?

43 votes

As far as I know, private is the default everywhere in C# (meaning that if I don't write public, protected, internal, etc. it will be private by default). (Please correct me if I am wrong.)

So, what's the reason to write that keyword, or why does it even exist for members?

For example, when an event handler is auto-generated it looks like this:

private void RatTrap_MouseEnter(object sender, CheeseEventArgs e)
{

}

But why does it even write private if that's implied and default? Just so that novice developers (who don't know it's the C# default) know that it's private? Or is there a difference for the compiler?

Moreover, is there a case where writing "private" (alone) will change the accessibility of the member?

AFAIK, private is the default everywhere in C# (meaning that if I don't write public, protected, internal, etc. it will be private by default). (please correct me if wrong).

This is not true. Types defined within a namespace (classes, structs, interfaces, etc) will be internal by default. Also, members within different types have different default accessibilities (such as public for interface members). For details, see Accessibility Levels on MSDN.

Also,

So, what's the reason to write that keyword, or why does it even exist?

Specifying this explicitly helps denote your intention to make the type private, very explicitly. This helps with maintainability of your code over time. This can help with other developers (or yourself) knowing whether a member is private by default or on purpose, etc.

Is it a good practice to have logger as a singleton?

40 votes

I had a habit to pass logger to constructor, like:

public class OrderService : IOrderService {
     public OrderService(ILogger logger) {
     }
}

But that is quite annoying, so I've used it a property this for some time:

private ILogger logger = NullLogger.Instance;
public ILogger Logger
{
    get { return logger; }
    set { logger = value; }
}

This is getting annoying too - it is not dry, I need to repeat this in every class. I could use base class, but then again - I'm using Form class, so would need FormBase, etc. So I think, what would be downside of having singleton with ILogger exposed, so veryone would know where to get logger:

    Infrastructure.Logger.Info("blabla");

UPDATE: As Merlyn correctly noticed, I've should mention, that in first and second examples I am using DI.

This is getting annoying too - it is not DRY

That's true. But there is only so much you can do for a cross-cutting concern that pervades every type you have. You have to use the logger everywhere, so you must have the property on those types.

So lets see what we can do about it.

Singleton

Singletons are terrible <flame-suit-on>.

I recommend sticking with property injection as you've done with your second example. This is the best factoring you can do without resorting to magic. It is better to have an explicit dependency than to hide it via a singleton.

But if singletons save you significant time, including all refactoring you will ever have to do (crystal ball time!), I suppose you might be able to live with them. If ever there were a use for a Singleton, this might be it. Keep in mind the cost if you ever want to change your mind will be about as high as it gets.

If you do this, check out other people's answers using the Registry pattern (see the description), and those registering a (resetable) singleton factory rather than a singleton logger instance.

There are other alternatives that might work just as well without as much compromise, so you should check them out first.

Visual Studio code snippets

You could use Visual Studio code snippets to speed up the entrance of that repetitive code. You will be able to type something like loggertab, and the code will magically appear for you.

Using AOP to DRY off

You could eliminate a little bit of that property injection code by using an Aspect Oriented Programming (AOP) framework like PostSharp to auto-generate some of it.

It might look something like this when you're done:

[InjectedLogger]
public ILogger Logger { get; set; }

You could also use their method tracing sample code to automatically trace method entrance and exit code, which might eliminate the need to add some of the logger properties all together. You could apply the attribute at a class level, or namespace wide:

[Trace]
public class MyClass
{
    // ...
}

// or

#if DEBUG
[assembly: Trace( AttributeTargetTypes = "MyNamespace.*",
    AttributeTargetTypeAttributes = MulticastAttributes.Public,
    AttributeTargetMemberAttributes = MulticastAttributes.Public )]
#endif

Why does the Double.valueof javadoc say it caches values, when it doesn't?

39 votes

In OpenJDK, for the method:

public static Double valueOf(double d)

The javadoc says:

Returns a Double instance representing the specified double value. If a new Double instance is not required, this method should generally be used in preference to the constructor Double(double), as this method is likely to yield significantly better space and time performance by caching frequently requested values.

Here's the actual code:

public static Double valueOf(double d) {
    return new Double(d);
}

The cache is a lie! What's going on here?

The method exists for many types: Integer, Long, BigDecimal and others and the documentation is always the same: Under some circumstances (which aren't defined), the method can return the same result.

AFAIK, the caching is only implemented for integer types and it returns cached instances for values between -128 and 127 (most common values). For BigDecimal, the cache currently works for values from 0 to 10.

Later versions of Java might extend this behavior to other values/more types. So it's smart to use this code today because it might make your code faster tomorrow (and the code won't be slower today).

The Java compiler, for example, uses this API when generating code for autoboxing.

Java merge 2 collections in O(1)

33 votes

I need to be able to merge 2 large collections into 1. Which collection type can I use best? I don't need random access to the individual elements. Usually I'd go for a linkedlist, however I can't merge 2 linkedlist in Java with a runtime of O(1), which could be done in many other languages, since I'll have to copy each element to the new list.

Edit: Thank you for all your answers. Your answers were all very helpful, and I managed to get the job done. Next time I will use my own implementation of a linked list to begin with.

You can create a concatenated Iterable view in O(1) using one of Guava's Iterables.concat methods:

Iterable<T> combined = Iterables.concat(list1, list2);

This will allow you to iterate over all the elements of both lists as one object without copying any elements.

Why can attributes in Java be public?

33 votes

As everybody knows, Java follows the paradigms of object orientation, where data encapsulation says, that fields (attributes) of an object should be hidden for the outer world and only accessed via methods or that methods are the only interface of the class for the outer world. So why is it possible to declare a field in Java as public, which would be against the data encapsulation paradigm?

I think it's possible because every rule has its exception, every best practice can be overridden in certain cases.

For example, I often expose public static final data members as public (e.g., constants). I don't think it's harmful.

I'll point out that this situation is true in other languages besides Java: C++, C#, etc.

Languages need not always protect us from ourselves.

In Oli's example, what's the harm if I write it this way?

public class Point {
   public final int x;
   public final int y;

   public Point(int p, int q) {
      this.x = p;
      this.y = q;
   } 
}

It's immutable and thread safe. The data members might be public, but you can't hurt them.

Besides, it's a dirty little secret that "private" isn't really private in Java. You can always use reflection to get around it.

So relax. It's not so bad.

Why does 'main' not return 0 here?

33 votes

I was just reading ISO/IEC 9899:201x Committee Draft — April 12, 2011 in which I found under 5.1.2.2.3 Program termination:

..reaching the } that terminates the main function returns a value of 0.

It means if you don't specify any return statement in main(), and if the program runs successfully, then at the closing brace, }, of main() will return 0.

But in the following code I don't specify any return statement, yet it does not return 0.

#include<stdio.h>
int sum(int a,int b)
{
    return (a + b);
}

int main()
{
    int a=10;
    int b=5;
    int ans;
    ans=sum(a,b);
    printf("sum is %d",ans);
}

Compile:

gcc test.c
./a.out
sum is 15
echo $?
9          // Here it should be 0, but it shows 9. Why?

That rule was added in the 1999 version of the C standard. In C90, the status returned is undefined.

You can enable it by passing -std=c99 to gcc.

As a side note, interestingly 9 is returned because it's the return of printf which just wrote 9 characters.

Should I use an empty property key?

32 votes

I've tested this only in Firefox, but apparently you can use an empty string as a key to a property in an object. For example, see the first property here:

var countsByStatus = { 
  "": 23, //unknown status
  "started": 45,
  "draft": 3,
  "accepted": 23,
  "hold": 2345,
  "fixed": 2,
  "published": 345
}

In skimming through the EcmaScript specs, it appears that (at least in 5), property keys are defined as strings, and strings as 0 or more characters. This implies that an empty string is a valid property name according to the specs.

Anyway, I'm tempted to use this in a section of code where I'm calculating summaries of some counts by the status of a data item (similar to what I've shown above). There are some items which might not have a status, and I need a placeholder for those. Since statuses are user-definable, I don't want to risk using a dummy word that might conflict.

It seems so simple and elegant, in looking at the data I can easily tell what the blank string would mean. It also makes the code a little bit more efficient, since the empty string would be the exact value of the status in the items without a status.

But at the same time, my instincts are telling me that something is wrong with it. I mean, apart from the chance that some browser might not support this, I feel like I've encountered a bug in JavaScript that will be fixed some day. But, at the same time, that's the same feeling I once had about a lot of other JavaScript features that I now use every day (such as the time I discovered that && and || returns the value of one of the operands, not just true or false).

An object's key must be a string, and the empty string ('') is a string. There is no cross browser issue that I've ever come across with empty strings, although there have been very few occasions where I thought it was acceptable to use an empty string as a key name.

I would discourage the general usage of '' as a key, but for a simple lookup, it'll work just fine, and sounds reasonable. It's a good place to add a comment noting the exceptional circumstance.

Additionally, during lookup you may have issues with values that are cast to a string:

o = {...} //some object
foo = 'bar';

//some examples
o[foo] //will return o['bar']
o[null] //will return o['null']
o[undefined] //will return o['undefined']

If you'd like to have null and undefined use the '' key, you may need to use a fallback:

key = key || '';

If you might have non-string values passed in, it's important to cast too:

key = key || '';
key = '' + key;

note that a value of 0 will turn into '', whereas a value of '0' will stay '0'.


In most cases, I find I'm picking a pre-defined value out of a hashtable object. To check that the value exists on the object there are a number of options:

if (o[key]) {...} //will be falsey if the value is falsey
if (key in o) {...} //will return true for functions as well as property values
if (o.hasOwnProperty(key)) {...} //returns true only for property values

How can I get what my main function has returned?

31 votes

In a C program if we want to give some input from terminal then we can give it by:

int main(int argc, char *argv[])

In the same way, if we want to get return value of main() function then how can we get it?

In each main() we write return 1 or return 0; how can I know what my main() has returned at terminal?

Edit:1

i get it that by echo $? we can get the return value of main() but it only allows to return value less then 125 (in linux) successfully, return value more then that can not be be succesfully received by $ variable so

why return type of main() is int ? why dont keep it short int ?

Edit2

from where i can finout the meaning of error code if main() has return more then 125 value ?

Most shells store the exit code of the previous run command in $? so you can store or display it.

$ ./a.out
$ echo $?     # note - after this command $? contains the exit code of echo!

or

$ ./a.out
$ exit_code=$?    # save the exit code in another shell variable.

Note that under linux, although you return an int, generally only values less than 126 are safe to use. Higher values are reserved to record other errors that might occur when attempting to run a command or to record which signal, if any, terminated your program.

How did Firefox optimize this loop?

31 votes

Firefox 9.0.1 surprised me by showing up my Ω(log n) number-padding algorithm with a Ω(n) loop method when n is small. In every other browser I've seen, the loop is slower, even for small values of n. I know all the browsers are working on optimizing JS, but since all the other, modern browsers are showing the loop to be slower, is there any explanation for the behavior in Firefox 9?

// Ω(log n)
function padNumberMath(number, length) {
    var N = Math.pow(10, length);
    return number < N ? ("" + (N + number)).slice(1) : "" + number
}

// Ω(n):
function padNumberLoop(number, length) {
    var my_string = '' + number;
    while (my_string.length < length) {
        my_string = '0' + my_string;
    }
    return my_string;
}

Update: I don't think this is related to the original question, but I just discovered that IE 9 switches behavior when switching from 32- to 64-bit modes. In 32-bit mode, the Math method wins. In 64-bit mode, the Loop method wins. Just thought I should point that out.

Update 2: MAK caught me in his comment below. The math method's not Ω(1), it's probably more like Ω(log n).

Looking at the results its pretty clear that Firefox didn't do anything to achieve a performance gain.

browserscope

These bars can be read as "speeds" (operations/sec) so bigger bars are better. Everything is to scale.

In Firefox 9, its very clear the "Math" method performs abysmally, while there is little change in "Loop" method between versions.

So there was no "optimization" of any sort in Firefox 9. All that happened between Firefox 8 and 9 with regard to these tests is somehow their math library got slower (Math.pow being slow), or their string library got slower (.slice() being slow).

Looking into this further, it's clear somehow these elementary operations got a bit slower in ff9:

ff8 vs ff9

Both concatenation and Math.pow are a bit slower in FF 9 than in FF 8, which may account for the difference you see in your tests.

Interestingly, the new no-op bar is much longer in FF8 than FF9.

C# Pre- & Post Increment confusions

29 votes

I am a little confused about how the C# compiler handles pre- and post increments and decrements...

When i code the following:

int x = 4;
x = x++ + ++x;

x will have the value 10 afterwards. I think this is because the pre-increment sets x to 5, which makes it 5+5 which evaluates to 10. Then the post-increment will update x to 6, but this value will not be used because then 10 will be assigned to x.

But when i code:

int x = 4;
x = x-- - --x;

then x will be 2 afterwards. Can anyone explain why this is the case?

Thanks :-)

x-- will be 4, but will be 3 at the moment of --x, so it will end being 2, then you'll have

x = 4 - 2

btw, your first case will be x = 4 + 6

Here is a small example that will print out the values for each part, maybe this way you'll understand it better:

static void Main(string[] args)
{
    int x = 4;
    Console.WriteLine("x++: {0}", x++); //after this statement x = 5
    Console.WriteLine("x: {0}", ++x); 

    int y = 4;
    Console.WriteLine("y--: {0}", y--); //after this statement y = 3
    Console.WriteLine("--y: {0}", --y);

    Console.ReadKey();
}

this prints out

x++: 4
x: 6
y--: 4
--y: 2