Best c questions in December 2011

Why is one loop so much slower than two loops?

381 votes

Suppose a1, b1, c1, and d1 point to heap memory and my numerical code has the following core loop.

const int n=100000

for(int j=0;j<n;j++){
    a1[j] += b1[j];
    c1[j] += d1[j];
}

This loop is executed 10,000 times via another outer for loop. To speed it up, I changed the code to:

for(int j=0;j<n;j++){
    a1[j] += b1[j];
}
for(int j=0;j<n;j++){
    c1[j] += d1[j];
}

Compiled on MS Visual C++ 10.0 with full optimization and SSE2 enabled for 32-bit on a Intel Core 2 Duo (x64), the first example takes 5.5 seconds and the double-loop example takes only 1.9 seconds. My question is: (Please refer to the my rephrased question at the bottom)

PS: I am not sure, if this helps:

Disassembly for the first loop basically looks like this (this block is repeated about five times in the full program):

movsd       xmm0,mmword ptr [edx+18h]
addsd       xmm0,mmword ptr [ecx+20h]
movsd       mmword ptr [ecx+20h],xmm0
movsd       xmm0,mmword ptr [esi+10h]
addsd       xmm0,mmword ptr [eax+30h]
movsd       mmword ptr [eax+30h],xmm0
movsd       xmm0,mmword ptr [edx+20h]
addsd       xmm0,mmword ptr [ecx+28h]
movsd       mmword ptr [ecx+28h],xmm0
movsd       xmm0,mmword ptr [esi+18h]
addsd       xmm0,mmword ptr [eax+38h]

Each loop of the double loop example produces this code (the following block is repeated about three times):

addsd       xmm0,mmword ptr [eax+28h]
movsd       mmword ptr [eax+28h],xmm0
movsd       xmm0,mmword ptr [ecx+20h]
addsd       xmm0,mmword ptr [eax+30h]
movsd       mmword ptr [eax+30h],xmm0
movsd       xmm0,mmword ptr [ecx+28h]
addsd       xmm0,mmword ptr [eax+38h]
movsd       mmword ptr [eax+38h],xmm0
movsd       xmm0,mmword ptr [ecx+30h]
addsd       xmm0,mmword ptr [eax+40h]
movsd       mmword ptr [eax+40h],xmm0

EDIT: The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:

Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?

It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.

PPS: The full code is at http://pastebin.com/ivzkuTzG. It uses TBB Tick_Count for higher resolution timing, which can be disabled by not defining the TBB_TIMING Macro.

(It shows FLOP/s for different values of n.)

enter image description here

Upon further analysis of this, I believe this is (at least partially) caused by data alignment of the four pointers. This will cause some level of cache bank/way conflicts.

If I've guessed correctly on how you are allocating your arrays, they are likely to be aligned to the page line.

This means that all your accesses in each loop will fall on the same cache way. However, Intel processors have had 8-way L1 cache associativity for a while. But in reality, the performance isn't completely uniform. Accessing 4-ways is still slower than say 2-ways.

EDIT : It does in fact look like you are allocating all the arrays separately. Usually when such large allocations are requested, the allocator will request fresh pages from the OS. Therefore, there is a high chance that large allocations will appear at the same offset from a page-boundary.

Here's the test code:

int main(){
    const int n = 100000;

#ifdef ALLOCATE_SEPERATE
    double *a1 = (double*)malloc(n * sizeof(double));
    double *b1 = (double*)malloc(n * sizeof(double));
    double *c1 = (double*)malloc(n * sizeof(double));
    double *d1 = (double*)malloc(n * sizeof(double));
#else
    double *a1 = (double*)malloc(n * sizeof(double) * 4);
    double *b1 = a1 + n;
    double *c1 = b1 + n;
    double *d1 = c1 + n;
#endif

    //  Zero the data to prevent any chance of denormals.
    memset(a1,0,n * sizeof(double));
    memset(b1,0,n * sizeof(double));
    memset(c1,0,n * sizeof(double));
    memset(d1,0,n * sizeof(double));

    //  Print the addresses
    cout << a1 << endl;
    cout << b1 << endl;
    cout << c1 << endl;
    cout << d1 << endl;

    clock_t start = clock();

    int c = 0;
    while (c++ < 10000){

#if ONE_LOOP
        for(int j=0;j<n;j++){
            a1[j] += b1[j];
            c1[j] += d1[j];
        }
#else
        for(int j=0;j<n;j++){
            a1[j] += b1[j];
        }
        for(int j=0;j<n;j++){
            c1[j] += d1[j];
        }
#endif

    }

    clock_t end = clock();
    cout << "seconds = " << (double)(end - start) / CLOCKS_PER_SEC << endl;

    system("pause");
    return 0;
}

Benchmark Results:

EDIT: Results on an actual Core 2 architecture machine:

2 x Intel Xeon X5482 Harpertown @ 3.2 GHz:

#define ALLOCATE_SEPERATE
#define ONE_LOOP
00600020
006D0020
007A0020
00870020
seconds = 6.206

#define ALLOCATE_SEPERATE
//#define ONE_LOOP
005E0020
006B0020
00780020
00850020
seconds = 2.116

//#define ALLOCATE_SEPERATE
#define ONE_LOOP
00570020
00633520
006F6A20
007B9F20
seconds = 1.894

//#define ALLOCATE_SEPERATE
//#define ONE_LOOP
008C0020
00983520
00A46A20
00B09F20
seconds = 1.993

Observations:

  • 6.206 seconds with one loop and 2.116 seconds with two loops. This reproduces the OP's results exactly.

  • In the first two tests, the arrays are allocated separately. You'll notice that they all have the same alignment relative to the page.

  • In the second two tests, the arrays are packed together to break that alignment. Here you'll notice both loops are faster. Furthermore, the second (double) loop is now the slower one as you would normally expect.

As @Stephen Cannon points out in the comments, there is very likely possibility that this alignment causes false aliasing in the load/store units or the cache. I Googled around for this and found that Intel actually has a hardware counter for partial address aliasing stalls:

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/lin/ug_docs/reference/pmw_dp/events/partial_address_alias.html


5 Regions - Explanations

Region 1:

This one is easy. The dataset is so small that the performance is dominated by overhead like looping and branching.

Region 2:

Here, as the data sizes increases, the amount of relative overhead goes down and the performance "saturates". Here two loops is slower because it has twice as much loop and branching overhead.

I'm not sure exactly what's going on here... Alignment could still play an effect as Agner Fog mentions cache bank conflicts. (That link is about Sandy Bridge, but the idea should still be applicable to Core 2.)

Region 3:

At this point, the data no longer fits in L1 cache. So performance is capped by the L1 <-> L2 cache bandwidth.

Region 4:

The performance drop in the single-loop is what we are observing. And as mentioned, this is due to the alignment which (most likely) causes false aliasing stalls in the processor load/store units.

However, in order for false aliasing to occur, there must be a large enough stride between the datasets. This is why you don't see this in region 3.

Region 5:

At this point, nothing fits in cache. So you're bound by memory bandwidth.


2 x Intel X5482 Harpertown @ 3.2 GHz Intel Core i7 870 @ 2.8 GHz Intel Core i7 2600K @ 4.4 GHz

how to achieve 4 flops per cycle

85 votes

How can the theoretical peak performance of 4 floating point operations (double precision) per cycle be achieved on a modern x86-64 Intel cpu?

As far as I understand does it take 3 cycles for an sse add and 5 cycles for a mul to complete on most of the modern Intel cpu's (see e.g. Agner Fog's http://agner.org/optimize/instruction_tables.pdf). Due to pipelining one can get a throughput of 1 add per cycle if the algorithm has at least 3 independent summations. Since that is true for packed addpd as well as the scalar addsd versions and sse registers can contain 2 double's the throughput can be as much as 2 flops per cycle. Furthermore it seems (although I've not seen any proper doc on this) add's and mul's can be executed in parallel giving a theoretical max throughput of 4 flops per cycle.

However, I've not been able to replicate that performance with a simple c/c++ programme. My best attempt results in about 2.7 flops/cycle. If anyone can contribute a simple c/c++ or assembler programme which demonstrates peak performance that'd be greatly appreciated.

My attempt:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>

double stoptime(void) {
   struct timeval t;
   gettimeofday(&t,NULL);
   return (double) t.tv_sec + t.tv_usec/1000000.0;
}

double addmul(double add, double mul, int ops){
   // need to initialise differently otherwise compiler might optimise away
   double sum1=0.1, sum2=-0.1, sum3=0.2, sum4=-0.2, sum5=0.0;
   double mul1=1.0, mul2= 1.1, mul3=1.2, mul4= 1.3, mul5=1.4;
   int loops=ops/10;          // we have 10 floating point ops inside the loop
   double expected = 5.0*add*loops + (sum1+sum2+sum3+sum4+sum5)
               + pow(mul,loops)*(mul1+mul2+mul3+mul4+mul5);

   for(int i=0; i<loops; i++) {
      mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
      sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
   }
   return  sum1+sum2+sum3+sum4+sum5+mul1+mul2+mul3+mul4+mul5 - expected;
}

int main(int argc, char** argv) {
   if(argc!=2) {
      printf("usage: %s <num>\n", argv[0]);
      printf("number of operations: <num> millions\n");
      exit(EXIT_FAILURE);
   }
   int n=atoi(argv[1])*1000000;
   if(n<=0) n=1000;

   double x=M_PI;
   double y=1.0+1e-8;
   double t=stoptime();
   x=addmul(x,y,n);
   t=stoptime()-t;
   printf("addmul:\t %.3f s, %.3f Gflops, res=%f\n",t,(double)n/t/1e9,x);

   return EXIT_SUCCESS;
}

Compiled with

g++ -O2 -march=native addmul.cpp ; ./a.out 1000

gives on an Intel Core i5-750, 2.66 GHz

addmul:  0.270 s, 3.707 Gflops, res=1.326463

i.e. just about 1.4 flops per cycle. Looking at the assembler code with g++ -S -O2 -march=native -masm=intel addmul.cpp the main loop seems kind of optimal to me:

.L4:
inc eax
mulsd   xmm8, xmm3
mulsd   xmm7, xmm3
mulsd   xmm6, xmm3
mulsd   xmm5, xmm3
mulsd   xmm1, xmm3
addsd   xmm13, xmm2
addsd   xmm12, xmm2
addsd   xmm11, xmm2
addsd   xmm10, xmm2
addsd   xmm9, xmm2
cmp eax, ebx
jne .L4

Changing the scalar versions with packed versions (addpd and mulpd) would double the flop count without changing the execution time and so I'd get just short of 2.8 flops per cycle. Any simple example which achieves 4 flops per cycle?

Edit:

Nice little programme by Mysticial, here are my results (run just for a few seconds though):

  • gcc -O2 -march=nocona: 5.6 Gflops out of 10.66 Gflops (2.1 flops/cycle)
  • cl /O2, openmp removed: 10.1 Gflops out of 10.66 Gflops (3.8 flops/cycle)

It all seems a bit complex but my conclusions so far:

  • gcc -O2 changes the order of independent floating point operations with the aim of alternating addpd and mulpd's if possible. Same applies to gcc-4.6.2 -O2 -march=core2.

  • gcc -O2 -march=nocona seems to keep the order of fp operations as defined in the C++ source.

  • cl /O2, the 64-bit compiler from the SDK for Windows 7 does loop-unrolling automatically and seems to try and arrange operations so that groups of 3 addpd's alternate with 3 mulpd's (well at least on my system and for my simple programme).

  • My Core i5 750 (Nahelem architecture) doesn't like alternating add's and mul's and seems unable to run both ops in parallel. However, if grouped in 3's it suddenly works like magic.

  • Other architectures (possibly Sandy Bridge and others) appear to be able to execute add/mul in parallel without problems if they alternate in the assembly code.

  • Although difficult to admit, but on my system cl /O2 does a much better job at low level optimising operations for my system and achieves close to peak performance for the little c++ example above. I measured between 1.85-2.01 flops/cycle (have used clock() in Windows which is not that precise I guess, need to use a better timer - thanks Mackie Messer).

  • The best I managed with gcc was to manually loop unroll and arrange additions and multiplications in groups of three. With g++ -O2 -march=nocona addmul_unroll.cpp I get at best 0.207s, 4.825 Gflops which corresponds to 1.8 flops/cycle which I'm quite happy with now.

In the c++ code I've replaced the for loop with

   for(int i=0; i<loops/3; i++) {
      mul1*=mul; mul2*=mul; mul3*=mul;
      sum1+=add; sum2+=add; sum3+=add;
      mul4*=mul; mul5*=mul; mul1*=mul;
      sum4+=add; sum5+=add; sum1+=add;

      mul2*=mul; mul3*=mul; mul4*=mul;
      sum2+=add; sum3+=add; sum4+=add;
      mul5*=mul; mul1*=mul; mul2*=mul;
      sum5+=add; sum1+=add; sum2+=add;

      mul3*=mul; mul4*=mul; mul5*=mul;
      sum3+=add; sum4+=add; sum5+=add;
   }

and the assembly now looks like

.L4:
mulsd   xmm8, xmm3
mulsd   xmm7, xmm3
mulsd   xmm6, xmm3
addsd   xmm13, xmm2
addsd   xmm12, xmm2
addsd   xmm11, xmm2
mulsd   xmm5, xmm3
mulsd   xmm1, xmm3
mulsd   xmm8, xmm3
addsd   xmm10, xmm2
addsd   xmm9, xmm2
addsd   xmm13, xmm2
...

I've done this exact task before. But it was mainly to measure power consumption and CPU temperatures. The following code (which is fairly long) achieves close to optimal on my Core i7 2600K.

The key thing to note here is the massive amount of manual loop-unrolling as well as interleaving of multiplies and adds...

Note that this is optimized for x64. x86 doesn't have enough registers for this to compile well.

Warning:

If you decide to compile and run this, pay attention to your CPU temperatures!!! Make sure you don't overheat it. And make sure CPU-throttling doesn't affect your results!

double test_dp_mac_SSE(double x,double y,size_t iterations){
    register __m128d r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,rA,rB,rC,rD,rE,rF;

    //  Generate starting data.
    r0 = _mm_set1_pd(x);
    r1 = _mm_set1_pd(y);

    r8 = _mm_set1_pd(-0.0);

    r2 = _mm_xor_pd(r0,r8);
    r3 = _mm_or_pd(r0,r8);
    r4 = _mm_andnot_pd(r8,r0);
    r5 = _mm_mul_pd(r1,_mm_set1_pd(0.37796447300922722721));
    r6 = _mm_mul_pd(r1,_mm_set1_pd(0.24253562503633297352));
    r7 = _mm_mul_pd(r1,_mm_set1_pd(4.1231056256176605498));
    r8 = _mm_add_pd(r0,_mm_set1_pd(0.37796447300922722721));
    r9 = _mm_add_pd(r1,_mm_set1_pd(0.24253562503633297352));
    rA = _mm_sub_pd(r0,_mm_set1_pd(4.1231056256176605498));
    rB = _mm_sub_pd(r1,_mm_set1_pd(4.1231056256176605498));

    rC = _mm_set1_pd(1.4142135623730950488);
    rD = _mm_set1_pd(1.7320508075688772935);
    rE = _mm_set1_pd(0.57735026918962576451);
    rF = _mm_set1_pd(0.70710678118654752440);

    uint64 iMASK = 0x800fffffffffffffull;
    __m128d MASK = _mm_set1_pd(*(double*)&iMASK);
    __m128d vONE = _mm_set1_pd(1.0);

    size_t c = 0;
    while (c < iterations){
        size_t i = 0;
        while (i < 1000){
            //  Here's the meat - the part that really matters.

            r0 = _mm_mul_pd(r0,rC);
            r1 = _mm_add_pd(r1,rD);
            r2 = _mm_mul_pd(r2,rE);
            r3 = _mm_sub_pd(r3,rF);
            r4 = _mm_mul_pd(r4,rC);
            r5 = _mm_add_pd(r5,rD);
            r6 = _mm_mul_pd(r6,rE);
            r7 = _mm_sub_pd(r7,rF);
            r8 = _mm_mul_pd(r8,rC);
            r9 = _mm_add_pd(r9,rD);
            rA = _mm_mul_pd(rA,rE);
            rB = _mm_sub_pd(rB,rF);

            r0 = _mm_add_pd(r0,rF);
            r1 = _mm_mul_pd(r1,rE);
            r2 = _mm_sub_pd(r2,rD);
            r3 = _mm_mul_pd(r3,rC);
            r4 = _mm_add_pd(r4,rF);
            r5 = _mm_mul_pd(r5,rE);
            r6 = _mm_sub_pd(r6,rD);
            r7 = _mm_mul_pd(r7,rC);
            r8 = _mm_add_pd(r8,rF);
            r9 = _mm_mul_pd(r9,rE);
            rA = _mm_sub_pd(rA,rD);
            rB = _mm_mul_pd(rB,rC);

            r0 = _mm_mul_pd(r0,rC);
            r1 = _mm_add_pd(r1,rD);
            r2 = _mm_mul_pd(r2,rE);
            r3 = _mm_sub_pd(r3,rF);
            r4 = _mm_mul_pd(r4,rC);
            r5 = _mm_add_pd(r5,rD);
            r6 = _mm_mul_pd(r6,rE);
            r7 = _mm_sub_pd(r7,rF);
            r8 = _mm_mul_pd(r8,rC);
            r9 = _mm_add_pd(r9,rD);
            rA = _mm_mul_pd(rA,rE);
            rB = _mm_sub_pd(rB,rF);

            r0 = _mm_add_pd(r0,rF);
            r1 = _mm_mul_pd(r1,rE);
            r2 = _mm_sub_pd(r2,rD);
            r3 = _mm_mul_pd(r3,rC);
            r4 = _mm_add_pd(r4,rF);
            r5 = _mm_mul_pd(r5,rE);
            r6 = _mm_sub_pd(r6,rD);
            r7 = _mm_mul_pd(r7,rC);
            r8 = _mm_add_pd(r8,rF);
            r9 = _mm_mul_pd(r9,rE);
            rA = _mm_sub_pd(rA,rD);
            rB = _mm_mul_pd(rB,rC);

            i++;
        }

        //  Need to renormalize to prevent denormal/overflow.
        r0 = _mm_and_pd(r0,MASK);
        r1 = _mm_and_pd(r1,MASK);
        r2 = _mm_and_pd(r2,MASK);
        r3 = _mm_and_pd(r3,MASK);
        r4 = _mm_and_pd(r4,MASK);
        r5 = _mm_and_pd(r5,MASK);
        r6 = _mm_and_pd(r6,MASK);
        r7 = _mm_and_pd(r7,MASK);
        r8 = _mm_and_pd(r8,MASK);
        r9 = _mm_and_pd(r9,MASK);
        rA = _mm_and_pd(rA,MASK);
        rB = _mm_and_pd(rB,MASK);
        r0 = _mm_or_pd(r0,vONE);
        r1 = _mm_or_pd(r1,vONE);
        r2 = _mm_or_pd(r2,vONE);
        r3 = _mm_or_pd(r3,vONE);
        r4 = _mm_or_pd(r4,vONE);
        r5 = _mm_or_pd(r5,vONE);
        r6 = _mm_or_pd(r6,vONE);
        r7 = _mm_or_pd(r7,vONE);
        r8 = _mm_or_pd(r8,vONE);
        r9 = _mm_or_pd(r9,vONE);
        rA = _mm_or_pd(rA,vONE);
        rB = _mm_or_pd(rB,vONE);

        c++;
    }

    r0 = _mm_add_pd(r0,r1);
    r2 = _mm_add_pd(r2,r3);
    r4 = _mm_add_pd(r4,r5);
    r6 = _mm_add_pd(r6,r7);
    r8 = _mm_add_pd(r8,r9);
    rA = _mm_add_pd(rA,rB);

    r0 = _mm_add_pd(r0,r2);
    r4 = _mm_add_pd(r4,r6);
    r8 = _mm_add_pd(r8,rA);

    r0 = _mm_add_pd(r0,r4);
    r0 = _mm_add_pd(r0,r8);

    //  Prevent Dead Code Elimination
    double out = 0;
    out += ((double*)&r0)[0];
    out += ((double*)&r0)[1];

    return out;
}

void test_dp_mac_SSE(int tds,size_t iterations){

    double *sum = (double*)malloc(tds * sizeof(double));
    wclk start = wclk_now();

#pragma omp parallel num_threads(tds)
    {
        double ret = test_dp_mac_SSE(1.1,2.1,iterations);
        sum[omp_get_thread_num()] = ret;
    }

    double secs = wclk_secs_since(start);
    uint64 ops = 48 * 1000 * iterations * tds * 2;
    cout << "Seconds = " << secs << endl;
    cout << "FP Ops  = " << ops << endl;
    cout << "FLOPs   = " << ops / secs << endl;

    double out = 0;
    int c = 0;
    while (c < tds){
        out += sum[c++];
    }

    cout << "sum = " << out << endl;
    cout << endl;

    free(sum);
}

Output (1 thread, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

Seconds = 55.5104
FP Ops  = 960000000000
FLOPs   = 1.7294e+010
sum = 2.22652

The machine is a Core i7 2600K @ 4.4 GHz. Theoretical SSE peak is 4 flops * 4.4 GHz = 17.6 GFlops. This code achieves 17.3 GFlops - not bad.

Output (8 threads, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

Seconds = 117.202
FP Ops  = 7680000000000
FLOPs   = 6.55279e+010
sum = 17.8122

Theoretical SSE peak is 4 flops * 4 cores * 4.4 GHz = 70.4 GFlops. Actual is 65.5 GFlops.


Let's take this one step further. AVX...

double test_dp_mac_AVX(double x,double y,size_t iterations){
    register __m256d r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,rA,rB,rC,rD,rE,rF;

    //  Generate starting data.
    r0 = _mm256_set1_pd(x);
    r1 = _mm256_set1_pd(y);

    r8 = _mm256_set1_pd(-0.0);

    r2 = _mm256_xor_pd(r0,r8);
    r3 = _mm256_or_pd(r0,r8);
    r4 = _mm256_andnot_pd(r8,r0);
    r5 = _mm256_mul_pd(r1,_mm256_set1_pd(0.37796447300922722721));
    r6 = _mm256_mul_pd(r1,_mm256_set1_pd(0.24253562503633297352));
    r7 = _mm256_mul_pd(r1,_mm256_set1_pd(4.1231056256176605498));
    r8 = _mm256_add_pd(r0,_mm256_set1_pd(0.37796447300922722721));
    r9 = _mm256_add_pd(r1,_mm256_set1_pd(0.24253562503633297352));
    rA = _mm256_sub_pd(r0,_mm256_set1_pd(4.1231056256176605498));
    rB = _mm256_sub_pd(r1,_mm256_set1_pd(4.1231056256176605498));

    rC = _mm256_set1_pd(1.4142135623730950488);
    rD = _mm256_set1_pd(1.7320508075688772935);
    rE = _mm256_set1_pd(0.57735026918962576451);
    rF = _mm256_set1_pd(0.70710678118654752440);

    uint64 iMASK = 0x800fffffffffffffull;
    __m256d MASK = _mm256_set1_pd(*(double*)&iMASK);
    __m256d vONE = _mm256_set1_pd(1.0);

    size_t c = 0;
    while (c < iterations){
        size_t i = 0;
        while (i < 1000){
            //  Here's the meat - the part that really matters.

            r0 = _mm256_mul_pd(r0,rC);
            r1 = _mm256_add_pd(r1,rD);
            r2 = _mm256_mul_pd(r2,rE);
            r3 = _mm256_sub_pd(r3,rF);
            r4 = _mm256_mul_pd(r4,rC);
            r5 = _mm256_add_pd(r5,rD);
            r6 = _mm256_mul_pd(r6,rE);
            r7 = _mm256_sub_pd(r7,rF);
            r8 = _mm256_mul_pd(r8,rC);
            r9 = _mm256_add_pd(r9,rD);
            rA = _mm256_mul_pd(rA,rE);
            rB = _mm256_sub_pd(rB,rF);

            r0 = _mm256_add_pd(r0,rF);
            r1 = _mm256_mul_pd(r1,rE);
            r2 = _mm256_sub_pd(r2,rD);
            r3 = _mm256_mul_pd(r3,rC);
            r4 = _mm256_add_pd(r4,rF);
            r5 = _mm256_mul_pd(r5,rE);
            r6 = _mm256_sub_pd(r6,rD);
            r7 = _mm256_mul_pd(r7,rC);
            r8 = _mm256_add_pd(r8,rF);
            r9 = _mm256_mul_pd(r9,rE);
            rA = _mm256_sub_pd(rA,rD);
            rB = _mm256_mul_pd(rB,rC);

            r0 = _mm256_mul_pd(r0,rC);
            r1 = _mm256_add_pd(r1,rD);
            r2 = _mm256_mul_pd(r2,rE);
            r3 = _mm256_sub_pd(r3,rF);
            r4 = _mm256_mul_pd(r4,rC);
            r5 = _mm256_add_pd(r5,rD);
            r6 = _mm256_mul_pd(r6,rE);
            r7 = _mm256_sub_pd(r7,rF);
            r8 = _mm256_mul_pd(r8,rC);
            r9 = _mm256_add_pd(r9,rD);
            rA = _mm256_mul_pd(rA,rE);
            rB = _mm256_sub_pd(rB,rF);

            r0 = _mm256_add_pd(r0,rF);
            r1 = _mm256_mul_pd(r1,rE);
            r2 = _mm256_sub_pd(r2,rD);
            r3 = _mm256_mul_pd(r3,rC);
            r4 = _mm256_add_pd(r4,rF);
            r5 = _mm256_mul_pd(r5,rE);
            r6 = _mm256_sub_pd(r6,rD);
            r7 = _mm256_mul_pd(r7,rC);
            r8 = _mm256_add_pd(r8,rF);
            r9 = _mm256_mul_pd(r9,rE);
            rA = _mm256_sub_pd(rA,rD);
            rB = _mm256_mul_pd(rB,rC);

            i++;
        }

        //  Need to renormalize to prevent denormal/overflow.
        r0 = _mm256_and_pd(r0,MASK);
        r1 = _mm256_and_pd(r1,MASK);
        r2 = _mm256_and_pd(r2,MASK);
        r3 = _mm256_and_pd(r3,MASK);
        r4 = _mm256_and_pd(r4,MASK);
        r5 = _mm256_and_pd(r5,MASK);
        r6 = _mm256_and_pd(r6,MASK);
        r7 = _mm256_and_pd(r7,MASK);
        r8 = _mm256_and_pd(r8,MASK);
        r9 = _mm256_and_pd(r9,MASK);
        rA = _mm256_and_pd(rA,MASK);
        rB = _mm256_and_pd(rB,MASK);
        r0 = _mm256_or_pd(r0,vONE);
        r1 = _mm256_or_pd(r1,vONE);
        r2 = _mm256_or_pd(r2,vONE);
        r3 = _mm256_or_pd(r3,vONE);
        r4 = _mm256_or_pd(r4,vONE);
        r5 = _mm256_or_pd(r5,vONE);
        r6 = _mm256_or_pd(r6,vONE);
        r7 = _mm256_or_pd(r7,vONE);
        r8 = _mm256_or_pd(r8,vONE);
        r9 = _mm256_or_pd(r9,vONE);
        rA = _mm256_or_pd(rA,vONE);
        rB = _mm256_or_pd(rB,vONE);

        c++;
    }

    r0 = _mm256_add_pd(r0,r1);
    r2 = _mm256_add_pd(r2,r3);
    r4 = _mm256_add_pd(r4,r5);
    r6 = _mm256_add_pd(r6,r7);
    r8 = _mm256_add_pd(r8,r9);
    rA = _mm256_add_pd(rA,rB);

    r0 = _mm256_add_pd(r0,r2);
    r4 = _mm256_add_pd(r4,r6);
    r8 = _mm256_add_pd(r8,rA);

    r0 = _mm256_add_pd(r0,r4);
    r0 = _mm256_add_pd(r0,r8);

    //  Prevent Dead Code Elimination
    double out = 0;
    out += ((double*)&r0)[0];
    out += ((double*)&r0)[1];
    out += ((double*)&r0)[2];
    out += ((double*)&r0)[3];

    return out;
}

void test_dp_mac_AVX(int tds,size_t iterations){

    double *sum = (double*)malloc(tds * sizeof(double));
    wclk start = wclk_now();

#pragma omp parallel num_threads(tds)
    {
        double ret = test_dp_mac_AVX(1.1,2.1,iterations);
        sum[omp_get_thread_num()] = ret;
    }

    double secs = wclk_secs_since(start);
    uint64 ops = 48 * 1000 * iterations * tds * 4;
    cout << "Seconds = " << secs << endl;
    cout << "FP Ops  = " << ops << endl;
    cout << "FLOPs   = " << ops / secs << endl;

    double out = 0;
    int c = 0;
    while (c < tds){
        out += sum[c++];
    }

    cout << "sum = " << out << endl;
    cout << endl;

    free(sum);
}

Output (1 thread, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

Seconds = 57.4679
FP Ops  = 1920000000000
FLOPs   = 3.34099e+010
sum = 4.45305

Theoretical AVX peak is 8 flops * 4.4 GHz = 35.2 GFlops. Actual is 33.4 GFlops.

Output (8 threads, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

Seconds = 111.119
FP Ops  = 15360000000000
FLOPs   = 1.3823e+011
sum = 35.6244

Theoretical AVX peak is 8 flops * 4 cores * 4.4 GHz = 140.8 GFlops. Actual is 138.2 GFlops.


Now for some explanations:

The performance critical part is obviously the 48 instructions inside the inner loop. You'll notice that it's broken into 4 blocks of 12 instructions each. Each of these 12 instructions blocks are completely independent from each other - and take on average 6 cycles to execute.

So there's 12 instructions and 6 cycles between issue-to-use. The latency of multiplication is 5 cycles, so it's just enough to avoid latency stalls.

The normalization step is needed to keep the data from over/underflowing. This is needed since the do-nothing code will slowly increase/decrease the magnitude of the data.

So it's actually possible to do better than this if you just use all zeros and get rid of the normalization step. However, since I wrote the benchmark to measure power consumption and temperature, I had to make sure the flops were on "real" data, rather than zeros - as the execution units may very well have special case-handling for zeros that use less power and produce less heat.


More Results:

  • Intel Core i7 920 @ 3.5 GHz
  • Windows 7 Ultimate x64
  • Visual Studio 2010 SP1 - x64 Release

Threads: 1

Seconds = 72.1116
FP Ops  = 960000000000
FLOPs   = 1.33127e+010
sum = 2.22652

Theoretical SSE Peak: 4 flops * 3.5 GHz = 14.0 GFlops. Actual is 13.3 GFlops.

Threads: 8

Seconds = 149.576
FP Ops  = 7680000000000
FLOPs   = 5.13452e+010
sum = 17.8122

Theoretical SSE Peak: 4 flops * 4 cores * 3.5 GHz = 56.0 GFlops. Actual is 51.3 GFlops.

My processor temps hit 76C on the multi-threaded run! If you runs these, be sure the results aren't affected by CPU throttling.


  • 2 x Intel Xeon X5482 Harpertown @ 3.2 GHz
  • Ubuntu Linux 10 x64
  • GCC 4.5.2 x64 - (-O2 -msse3 -fopenmp)

Threads: 1

Seconds = 78.3357
FP Ops  = 960000000000
FLOPs   = 1.22549e+10
sum = 2.22652

Theoretical SSE Peak: 4 flops * 3.2 GHz = 12.8 GFlops. Actual is 12.3 GFlops.

Threads: 8

Seconds = 78.4733
FP Ops  = 7680000000000
FLOPs   = 9.78676e+10
sum = 17.8122

Theoretical SSE Peak: 4 flops * 8 cores * 3.2 GHz = 102.4 GFlops. Actual is 97.9 GFlops.

What's wrong with this 1988 C code?

Asked on Tue, 27 Dec 2011 by César c
76 votes

I'm trying to compile this piece of code from the book "The C Programming Language" (K & R). It is a bare-bones version of the UNIX program wc:

#include <stdio.h>

#define IN   1;     /* inside a word */
#define OUT  0;     /* outside a word */

/* count lines, words and characters in input */
main() {
    int c, nl, nw, nc, state;

    state = OUT;
    nl = nw = nc = 0;
    while ((c = getchar()) != EOF) {
        ++nc;
        if (c == '\n')
            ++nl;
        if (c == ' ' || c == '\n' || c == '\t')
            state = OUT;
        else if (state == OUT) {
            state = IN;
            ++nw;
        }
    }
    printf("%d %d %d\n", nl, nw, nc);
}

And I'm getting the following error:

$ gcc wc.c 
wc.c: In function ‘main’:
wc.c:18: error: ‘else’ without a previous ‘if’
wc.c:18: error: expected ‘)’ before ‘;’ token

The 2nd edition of this book is from 1988 and I'm pretty new to C. Maybe it has to do with the compiler version or maybe I'm just talking nonsense.

I've seen in modern C code a different use of the main function:

int main() {
    /* code */
    return 0;
}

Is this a new standard or can I still use a type-less main?

Your problem is with your preprocessor definitions of IN and OUT:

#define IN   1;     /* inside a word */
#define OUT  0;     /* outside a word */

Notice how you have a trailing semicolon in each of these. When the preprocessor expands them, your code will look roughly like:

    if (c == ' ' || c == '\n' || c == '\t')
        state = 0;; /* <--PROBLEM #1 */
    else if (state == 0;) { /* <--PROBLEM #2 */
        state = 1;;

That second semicolon causes the else to have no previous if as a match, because you are not using braces. So, remove the semicolons from the preprocessor definitions of IN and OUT.

The lesson learned here is that preprocessor statements do not have to end with a semicolon.

Also, you should always use braces!

    if (c == ' ' || c == '\n' || c == '\t') {
        state = OUT;
    } else if (state == OUT) {
        state = IN;
        ++nw;
    }

There is no hanging-else ambiguity in the above code.

Why does 'main' not return 0 here?

33 votes

I was just reading ISO/IEC 9899:201x Committee Draft — April 12, 2011 in which I found under 5.1.2.2.3 Program termination:

..reaching the } that terminates the main function returns a value of 0.

It means if you don't specify any return statement in main(), and if the program runs successfully, then at the closing brace, }, of main() will return 0.

But in the following code I don't specify any return statement, yet it does not return 0.

#include<stdio.h>
int sum(int a,int b)
{
    return (a + b);
}

int main()
{
    int a=10;
    int b=5;
    int ans;
    ans=sum(a,b);
    printf("sum is %d",ans);
}

Compile:

gcc test.c
./a.out
sum is 15
echo $?
9          // Here it should be 0, but it shows 9. Why?

That rule was added in the 1999 version of the C standard. In C90, the status returned is undefined.

You can enable it by passing -std=c99 to gcc.

As a side note, interestingly 9 is returned because it's the return of printf which just wrote 9 characters.

How can I get what my main function has returned?

31 votes

In a C program if we want to give some input from terminal then we can give it by:

int main(int argc, char *argv[])

In the same way, if we want to get return value of main() function then how can we get it?

In each main() we write return 1 or return 0; how can I know what my main() has returned at terminal?

Edit:1

i get it that by echo $? we can get the return value of main() but it only allows to return value less then 125 (in linux) successfully, return value more then that can not be be succesfully received by $ variable so

why return type of main() is int ? why dont keep it short int ?

Edit2

from where i can finout the meaning of error code if main() has return more then 125 value ?

Most shells store the exit code of the previous run command in $? so you can store or display it.

$ ./a.out
$ echo $?     # note - after this command $? contains the exit code of echo!

or

$ ./a.out
$ exit_code=$?    # save the exit code in another shell variable.

Note that under linux, although you return an int, generally only values less than 126 are safe to use. Higher values are reserved to record other errors that might occur when attempting to run a command or to record which signal, if any, terminated your program.

Understanding confusing typedef grammar

27 votes

Consider the following code-snippet

typedef int type;
int main()
{
   type *type; // why is it allowed?
   type *k ;// which type?
}

I get an error 'k' is not declared in this scope. The compiler parses type *k as multiplication between type* and k. Isn't this grammar very confusing?

Why is type *type allowed by the C++ Standard? Because the grammar says so? Why?

The question is actually about when exactly a variable name is defined as an identifier, and the language determines that it is right after the point in code where the variable is declared:

typedef int type;
int main() {
   type t;   // type refers to ::type
   int       // type still refers to ::type
   type;     // variable declared, this shadows ::type
   type + 1; // type is a variable of type int.
}

There are similar rules in other contexts, and it is just a matter of deciding when identifiers are declared. There are other similar situations, for example in the initialization list of a class:

struct test {
   int x;          // declare member
   test( int x )   // declare parameter (shadows member)
   : x(            // refers to member (parameter is not legal here)
        x )        // refers to parameter
   {};
};

Or in the scope of the identifiers in the definition of member functions:

struct test {
   typedef int type;
   type f( type );
};
test::type         // qualification required, the scope of the return type is
                   // at namespace level
test::f(
         type t )  // but the scope of arguments is the class, no qualification
                   // required.
{}

As of the rationale for the decision, I cannot tell you but it is consistent and simple.

Does exception handling require object-oriented programming?

27 votes

At this point in my programming experience, I realize how spoiled I am to have exception handling available in most languages being used today (C++, .Net, Java, etc), at least in comparison to C. I am getting ready to take an advanced C course and has me really thinking those terms in comparison to my current paradigm.

In C, it's up to the programmer to prevent errors from ever occuring in the first place, which is quite daunting for anybody who is used to exception handling. It has occured to me that any language that I have come across that has exception handling happens to be object oriented. The first object oriented language to have exception handling, at least to my knowledge, is C++ which is sort of an evolution of C. (please correct me if I am wrong)

With that said, is there something about the object oriented nature of a language that allows exception handling, or was exception handling added as a feature as object oriented languages really started becoming a commonplace? What is that C lacks that to say, C++, in machine code that makes excpetion work?

I found this post about how exception handling works under the hood, but not sure how that information applies to my question (ie, does C lack notifications, continuations, etc?). Thanks in advance.

C lacks nothing in machine code, and exception handling was and is commonplace in C with setjmp and longjmp.

The reason for the complete lack of a language-level feature in purely procedural languages is that exception handling is identical to setjmp when no destructors need to be called. Exception handling has been around before in exotic languages, but never caught on because it was purely syntactic sugar. However, once destructors entered the scene and stack unwinding became necessary, language-level support became necessary and exception handling was widely implemented as part of the language.

To infinity and back

26 votes

There are mathematical operations that yield real numbers from +/- infinity. For example exp(-infinity) = 0. Is there a standard for mathematical functions in the standard C library that accept IEEE-754 infinities (without throwing, or returning NaN). I am on a linux system and would be interested in such a list for glibc. I could not find such a list in their online manual. For instance their documentation on exp does not mention how it handles the -infinity case. Any help will be much appreciated.

The See Also section of POSIX' math.h definition links to the POSIX definitions of acceptable domains.

E.g. fabs():

If x is ±0, +0 shall be returned.
If x is ±Inf, +Inf shall be returned.

I converted mentioned See Also-section to StackOverflow-Markdown:

acos(), acosh(), asin(), atan(), atan2(), cbrt(), ceil(), cos(), cosh(), erf(), exp(), expm1(), fabs(), floor(), fmod(), frexp(), hypot(), ilogb(), isnan(), j0(), ldexp(), lgamma(), log(), log10(), log1p(), logb(), modf(), nextafter(), pow(), remainder(), rint(), scalb(), sin(), sinh(), sqrt(), tan(), tanh(), y0(),

I contributed search/replace/regex-fu. We now just need someone with cURL-fu.

Rare cases where MACROs must be used

22 votes

Debugging macros can take a lot of time. We are much better off avoiding them except in the very rare cases when neither constants, functions nor templates can do what we want.

What are the rare cases?

If you want actual textual replacement, that's where you use macros. Take a look at Boost.Preprocessor, it's a great way to simulate variadic templates in C++03 without repeating yourself too much.

In other words, if you want to manipulate the program code itself, use macros.

Another useful application is assert, which is defined to be a no-op when NDEBUG is not defined (usually release mode compilation).

That brings us to the next point, which is a specialization of the first one: Different code with different compilation modes, or between different compilers. If you want cross-compiler support, you can't get away without macros. Take a look at Boost in general, it needs macros all the time because of various deficiencies in various compilers it has to support.

Repeated typedefs - invalid in C but valid in C++?

21 votes

I would like a standard reference why the following code triggers a compliance warning in C (tested with gcc -pedantic; "typedef redefinition"), but is fine in C++ (g++ -pedantic):

typedef struct Foo Foo;
typedef struct Foo Foo;

int main() { return 0; }

Why can I not define a typedef repeatedly in C?

(This has practical implications for the header structuring of a C project.)

Why does this compile in C++?

Because the C++ Standard explicitly says so.

Reference:

C++03 Standard 7.1.3 typedef specifier

§7.1.3.2:

In a given non-class scope, a typedef specifier can be used to redefine the name of any type declared in that scope to refer to the type to which it already refers.

[Example:
typedef struct s { /* ... */ } s;
typedef int I;
typedef int I;
typedef I I;
—end example]

Why does this fail to compile in C?

typedef names have no linkage and C99 standard disallows identifiers with no linkage specification to have more than one declaration with the same scope and in the same name space.

Reference:

C99 Standard: §6.2.2 Linkages of identifiers

§6.2.2/6 states:

The following identifiers have no linkage: an identifier declared to be anything other than an object or a function; an identifier declared to be a function parameter; a block scope identifier for an object declared without the storage-class specifierextern.

Further §6.7/3 states:

If an identifier has no linkage, there shall be no more than one declaration of the identifier (in a declarator or type specifier) with the same scope and in the same name space, except for tags as specified in 6.7.2.3.

What is uint_fast32_t and why should it be used instead of the regular int and uint32_t?

20 votes

So the reason for typedef:ed primitive data types is to abstract the low-level representation and make it easier to comprehend (uint64_t instead of long long type, which is 8 bytes).

However, there is uint_fast32_t which has the same typedef as uint32_t. Will using the "fast" version make the program faster?

Edit: You can look up the data types here.

  • int may be as small as 16-bits on some platforms. It may not be sufficient for your application.
  • uint32_t is not guaranteed to exist. It's an optional typedef that the implementation must provide iff it has an unsigned integer type of exactly 32-bit. Some have a 9-bit bytes for example, so they don't have a uint32_t.
  • uint_fast32_t states your intent clearly: it's a type of at least 32 bits which is the best from performance point-of-view. uint_fast32_t may be in fact 64 bit long. It's up to the implementation.

See also: Exotic architectures the standard committee cares about.

... there is uint_fast32_t which has the same typedef as uint32_t ...

What you are looking at is not the standard. It's a particular implementation (BlackBerry). So you can't deduce from there that uint_fast32_t is always the same as uint32_t.

Why does C need "struct" keyword and not C++?

16 votes

I've always been a little confused about what's going on here:

#include <stdio.h>

int main() {  
    timeval tv;
    tv.tv_sec = 1;

    for (;;) {
        select(0, 0, 0, 0, &tv);
        printf("%s\n", "Hello World!");
    }
}

Sorry if that doesn't compile, just wrote it as a quick example.

Code like this won't compile under gcc unless I add the keyword struct prior to the use of the struct timeval. g++ on the other hand handles it fine as is.

Is this a difference between how C and C++ handle structures or is it just a difference in the compilers? (I'm very C++ oriented, and the use of struct in C on lines like this has always somewhat baffled me).

Syntactically both treat struct almost the same. Only C++ has added an extra rule that allows to omit the struct (and class) keyword if there is no ambiguity.

If there is ambiguity, also C++ requires the struct keyword in some places. A notorious example is stat on POSIX systems where there is a struct stat and a function stat.

Why is 0 (zero) printed without leading "0x" with C printf format "%#x"?

16 votes

Background: I have a number of scripts that parse log files looking for hex numbers by finding the leading "0x". Our embedded C library changed to a new printf. The new printf is more standards compliant than our previous and my scripts broke.

On a Linux box:

#include <stdio.h>
int main( void )
{
    printf( "%#010x\n", 0 );
    printf( "%#010x\n", 1 );
    return 0;
}

Output (using glibc) is:

0000000000
0x00000001

Output on our firmwware was:

0x00000000
0x00000001

From printf(3), on the '#' flag character: "For x and X conversions, a nonzero result has the string "0x" (or "0X" for X conversions) prepended to it."

I'm curious why. Without digging through the C standards documents or buying lunch for standards committee members, why not a leading 0x on a zero valued argument?

The standard seems to be written this way:

  • %#x and %#o try to guarantee that the output can be parsed correctly using strtol with base = 0.

  • In these cases, the # flag adds as few extra characters as possible. For example, 0 is printed as 0 because there is no need to add the extra 0x. This makes a lot of sense if you do not specify the minimum field width and 0-padding.

  • If you wanted to add 0x always, you could simply write something like 0x%x. Hence %#x seems to be needed only in those special cases in which you really want the special handling of 0.

Now the way %#x is specified to work makes a lot of sense in isolation. And the way something like %010x is specified to work makes a lot of sense in isolation. You are combining these two modifiers, and the end result is, arguably, strange.

But there is no need to combine %#x and %010x. You could just write 0x%08x to do what you want.

double negation in C : is it guaranteed to return 0/1?

16 votes

Is !!(x) guaranteed by the standard to return 0/1?

Note that I am not asking about c++, where a bool type is defined.

Yes, in C99, see §6.5.3.3/4:

The result of the logical negation operator ! is 0 if the value of its operand compares unequal to 0, 1 if the value of its operand compares equal to 0. The result has type int. The expression !E is equivalent to (0==E).

So !x and !!y can only yield 0 or 1, as ints.

For other operators, in C99, see also Is the "true" result of >, <, !, &&, || or == defined?

Why 1103515245 is used in rand?

13 votes

I'm talking about this surprisingly simple implementation of rand() from the C standard:

static unsigned long int next = 1;

int rand(void)  /* RAND_MAX assumed to be 32767. */
{
    next = next * 1103515245 + 12345;
    return (unsigned)(next/65536) % 32768;
}

From this Wikipedia article we know that the multiplier a (in above code a = 1103515245) should fulfill only 2 conditions:
1. a - 1 is divisible by all prime factors of m.
(In our case m = 2^32, size of the int, so m has only one prime factor = 2)
2. a - 1 is a multiple of 4 if m is a multiple of 4.
(32768 is multiple of 4, and 1103515244 too)

Why they have chosen such a strange, hard-to-remember, "man, I'm fed up with these random numbers, write whatever" number, like 1103515245?
Maybe there are some wise reasons, that this number is somehow better then the other?

For example, why don't set a = 20000000001? It is bigger, cool-looking and easier to remember.

If you use a LCG to draw points on the d dimensional space, they will lie on at most m1/d hyperplanes. This is a known defect of LCGs.

If you don't chose carefully a and m (beyond the condition for full periodicity), they may lie on much fewer planes than that. Those numbers have been selected by what is called the spectral test.

The "spectral test" (the name comes from number theory) is the maximum distance between consecutive hyperplanes on which d-dimensional joint distributions lie. You want it to be as small as possible for as many d as you can test.

See this paper for a historical review on the topic. Note that the generator you quote is mentioned in the paper (as ANSIC) and deemed as not very good. The high order 16 bits are acceptable however, but many applications will need more than 32768 distinct values (as you point out in the comments, the period is indeed 2^31 -- the conditions for full periodicity in Wikipedia's link are probably only necessary).

The original source code in the ANSI document did not take the high order 16 bits, yielding a very poor generator which is easy to misuse (rand() % n is what people first think of to draw a number between 0 and n, and this yields something very non random in this case).

See also the discussion on LCGs in Numerical Recipes. Quoting:

Even worse, many early generators happened to make particularly bad choices for m and a. One infamous such routine, RANDU, with a = 65539 and m = 231, was widespread on IBM mainframe computers for many years, and widely copied onto other systems. One of us recalls as a graduate student producing a “random” plot with only 11 planes and being told by his computer center’s programming consultant that he had misused the random number generator: “We guarantee that each number is random individually, but we don’t guarantee that more than one of them is random.” That set back our graduate education by at least a year!

how can I detect whether a specific page is mapped in memory?

9 votes

I would like to detect whether or not a specific page has already been mapped in memory. The goal here is to be able to perform this check before calling mmap with a fixed memory address. The following code illustrates what happens in this case by default: mmap silently remaps the original memory pages.

#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
  int page_size;
  void *ptr;
  page_size = getpagesize();
  ptr = mmap(0, 10 * page_size, PROT_READ | PROT_WRITE,
             MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
  if (ptr == MAP_FAILED) {
    printf ("map1 failed\n");
    return 1;
  }
  ((int *)ptr)[0] = 0xdeadbeaf;
  ptr = mmap(ptr, 2 * page_size, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, 0, 0);
  if (ptr == MAP_FAILED) {
    printf ("map2 failed\n");
    return 1;
  }
  if (((int *)ptr)[0] != 0xdeadbeaf) {
    printf ("oops, data gone !\n");
  }
  return 0;
}

I understand that I could open and parse /proc/self/maps to figure out which memory range has been allocated and infer from that if I can safely request a specific memory range with mmap but I am looking for a proper API: is there such a thing ?

msync(addr, len, 0) and checking for ENOMEM seems to work (with a fairly superficial test).

Int to Double casting issue

7 votes

I'm an Objective-C developer with little C/C++ experience (and zero training), and I encountered something strange today with hard coded numeric values.

I'm sure it's a simple/stupid question, but can someone please explain why this works:

NSDate *start = [NSDate date];
dispatch_time_t popTime = dispatch_time(DISPATCH_TIME_NOW, 1 * NSEC_PER_SEC);

dispatch_after(popTime, dispatch_get_main_queue(), ^{
  NSLog(@"seconds: %f", [start timeIntervalSinceNow]);
});
// output: seconds: -1.0001

And this also works (note number of seconds has changed):

NSDate *start = [NSDate date];
dispatch_time_t popTime = dispatch_time(DISPATCH_TIME_NOW, 2 * NSEC_PER_SEC);

dispatch_after(popTime, dispatch_get_main_queue(), ^{
  NSLog(@"seconds: %f", [start timeIntervalSinceNow]);
});
// output: seconds: -2.0001

But this is executed immediately:

NSDate *start = [NSDate date];
dispatch_time_t popTime = dispatch_time(DISPATCH_TIME_NOW, 4 * NSEC_PER_SEC);

dispatch_after(popTime, dispatch_get_main_queue(), ^{
  NSLog(@"seconds: %f", [start timeIntervalSinceNow]);
});
// output: seconds: -0.0001

However, using 4.0 instead of 4 fixes it:

NSDate *start = [NSDate date];
dispatch_time_t popTime = dispatch_time(DISPATCH_TIME_NOW, 4.0 * NSEC_PER_SEC);

dispatch_after(popTime, dispatch_get_main_queue(), ^{
  NSLog(@"seconds: %f", [start timeIntervalSinceNow]);
});
// output: seconds: -4.0001

Why do 1 and 2 properly cast to the relevant double value, but bigger numbers (I tested 3 and 4) appear to be represented as 0?

I'm compiling with Xcode 4.2, configured to use LLVM 3.0.

EDIT:

dispatch_time_t is defined as:

typedef uint64_t dispatch_time_t;

And dispatch_time is:

dispatch_time_t dispatch_time(dispatch_time_t when, int64_t delta);

And NSEC_PER_SEC is:

#define NSEC_PER_SEC    1000000000  /* nanoseconds per second */

There are 1,000,000,000 nanoseconds in a second, so I'm going to assume that NSEC_PER_SEC is defined as 1000000000.

  • 4 is of type int
  • 4.0 is of type double

Now assuming that an int contains 32 bits, the range of an int would be [-2,147,483,648 to 2,147,483,647]

4000000000 > 2147483647, therefore you'll cause the int to overflow, which is causing the value to be set to 0.

EDIT: I probably could've worded the above statement better. The overflow could cause the int (assuming it's 32 bits in size, as stated above) to equal the value -294967296, and dispatch_time would be treating any value <= 0 as 0 seconds. That's where the "0" above came from.

A double variable can hold larger values than an int, and is able to store an approximation of the value 4000000000.

C standard I/O vs UNIX I/O basics

7 votes

Here's a very basic question I have. In my professor's lecture slide, there is a example I dont really get.

She wrote:

printf("u"); 
write(STDOUT_FILENO, "m", 1); 
printf("d\n");

...and she said the out put of this code would be:

mud

I don't get it. So if anyone understand why this happens, please explain to me.

Reference this question:

http://lagoon.cs.umd.edu/216/Lectures/lect17.pdf

(in the second last slide page.)

write is a system call -- it is implemented by the interface between user mode (where programs like yours run) and the operating system kernel (which handles the actual writing to disk when bytes are written to a file).

printf is a C standard library function -- it is implemented by library code loaded into your user mode program.

The C standard library output functions buffer their output, by default until end-of-line is reached. When the buffer is full or terminated with a newline, it is written to the file via a call to write from the library implementation.

Therefore, the output via printf is not sent to the operating system write immediately. In your example, you buffer the letter 'u', then immediately write the letter 'm', then append "d\n" to the buffer and the standard library makes the call write(STDOUT_FILENO, "ud\n");

Reading raw bytes from a serial port

7 votes

I'm trying to read raw bytes from a serial port sent by a IEC 870-5-101 win32 protocol simulator with a program written in C running on Linux 32bit.

It's working fine for byte values like 0x00 - 0x7F. But for values beginning from 0x80 to 0xAF the high bit is wrong, e.g.:

0x7F -> 0x7F //correct
0x18 -> 0x18 //correct
0x79 -> 0x79 //correct
0x80 -> 0x00 //wrong
0xAF -> 0x2F //wrong
0xFF -> 0x7F //wrong

After digging around for two days now, I have no idea, what's causing this.

This is my config of the serial port:

    cfsetispeed(&config, B9600);
    cfsetospeed(&config, B9600);

    config.c_cflag |= (CLOCAL | CREAD);

    config.c_cflag &= ~CSIZE;                               /* Mask the character size bits */
    config.c_cflag |= (PARENB | CS8);                       /* Parity bit Select 8 data bits */

    config.c_cflag &= ~(PARODD | CSTOPB);                   /* even parity, 1 stop bit */


    config.c_cflag |= CRTSCTS;                              /*enable RTS/CTS flow control - linux only supports rts/cts*/


    config.c_iflag &= ~(IXON | IXOFF | IXANY);              /*disable software flow control*/ 

    config.c_oflag &= ~OPOST;                               /* enable raw output */
    config.c_lflag &= ~(ICANON | ECHO | ECHOE | ISIG);      /* enable raw input */

    config.c_iflag &= ~(INPCK | PARMRK);                    /* DANGEROUS no parity check*/
    config.c_iflag |= ISTRIP;                               /* strip parity bits */
    config.c_iflag |= IGNPAR;                               /* DANGEROUS ignore parity errors*/

    config.c_cc[VTIME] = 1;                                 /*timeout to read a character in tenth of a second*/

I'm reading from the serial port with:

*bytesread = read((int) fd, in_buf, BytesToRead);

Right after this operation "in_buf" contains the wrong byte, so I guess there's something wrong with my config, which is a port from a win32 DCB structure.

Thanks for any ideas!

Based on your examples, only the 8th bit (the high bit) is wrong, and it's wrong by being always 0. You are setting ISTRIP in your line discipline on the Linux side, and that would cause this. ISTRIP does not, as the comment in the C code claims, strip parity bits. It strips the 8th data bit.

If ISTRIP is set, valid input bytes shall first be stripped to seven bits; otherwise, all eight bits shall be processed. IEEE Std 1003.1, 2004 Edition, chapter 11, General Terminal Interface