Best c questions in June 2011

Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?

389 votes

I am doing some numerical optimization on a scientific application. One thing I noticed is that, GCC will not recognize pow(a,6), and will call the library function pow, (although it recognizes pow(a,2)), which slows down the performance extremely. (Intel C++ Compiler, executable icc, will recognize it anyway.) So I replaced it by a*a*a*a*a*a. With GCC 4.5.1 and options "-O3 -lm -funroll-loops -msse4" it will do 5 mulsd:

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13

while if I write (a*a*a)*(a*a*a), it will have

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm13, %xmm13

which reduce the multiple to 3. icc has similar behavior.

Why cannot compilers recognize such an optimization trick?

Because Floating Point Math is not Associative. The way you group the operands in floating point multiplication has an effect on the numerical accuracy of the answer.

As a result, most compilers are very conservative about reordering floating point calculations unless they can be sure that the answer will stay the same, or unless you tell them you don't care about numerical accuracy. For example: the -ffast-math option of gcc.

Is multiplication and division using shift operators in C actually faster?

74 votes

Multiplication and division can be achieved using bit operators, for example

i*2 = i<<1
i*3 = (i<<1) + i;
i*10 = (i<<3) + (i<<1)

and so on.

Is it actually faster to use say (i<<3)+(i<<1) to multiply with 10 than using i*10 directly? Is there any sort of input that can't be multiplied or divided in this way?

Short answer: Not likely.

Long answer: Your compiler has an optimizer in it that knows how to multiply as quickly as your target processor architecture is capable. Your best bet is to tell the compiler your intent clearly (i.e. i*2 rather than i << 1) and let it decide what the fastest assembly/machine code sequence is. It's even possible that the processor itself has implemented the multiply instruction as a sequence of shifts & adds in microcode.

Bottom line--don't spend a lot of time worrying about this. If you mean to shift, shift. If you mean to multiply, multiply. Do what is semantically clearest--your coworkers will thank you later. Or, more likely, curse you later if you do otherwise.

Protecting executable from reverse engineering?

68 votes

I've been contemplating how to protect my C/C++ code from disassembly and reverse engineering. Normally I would never condone this behavior myself in my code; however the current protocol I've been working on must not ever be inspected or understandable, for the security of various people.

Now this is a new subject to me, and the internet is not really resourceful for prevention against reverse engineering but rather depicts tons of information on how to reverse engineer

Some of the things I've thought of so far are:

  • Code injection (calling dummy functions before and after actual function calls)
  • Code obfustication (mangles the disassembly of the binary)
  • Write my own startup routines (harder for debuggers to bind to)

    void startup();  
    int _start()   
    {  
        startup( );  
        exit   (0)   
    }  
    void startup()  
    {  
        /* code here */  
    }
    
  • Runtime check for debuggers (and force exit if detected)

  • Function trampolines

     void trampoline(void (*fnptr)(), bool ping = false)  
     {  
       if(ping)  
         fnptr();  
       else  
         trampoline(fnptr, true);  
     }
    
  • Pointless allocations and deallocations (stack changes a lot)

  • Pointless dummy calls and trampolines (tons of jumping in disassembly output)
  • Tons of casting (for obfuscated disassembly)

I mean these are some of the things I've thought of but they can all be worked around and or figured out by code analysts given the right time frame. Is there anything else alternative I have?

What Amber said is exactly right. You can make reverse engineering harder, but you can never prevent it. You should never trust "security" that relies on the prevention of reverse engineering.

That said, the best anti-reverse-engineering techniques that I've seen focused not on obfuscating the code, but instead on breaking the tools that people usually use to understand how code works. Finding creative ways to break disassemblers, debuggers, etc is both likely to be more effective and also more intellectually satisfying than just generating reams of horrible spaghetti code. This does nothing to block a determined attacker, but it does increase the likelihood that J Random Cracker will wander off and work on something easier instead.

gcc compile error with >2GB of code

61 votes

I have a huge number of functions totaling around 2.8 GB of object code (unfortunately there's no way around, scientific computing ...)

When I try to link them, I get (expected) relocation truncated to fit: R_X86_64_32S errors, that I hoped to circumvent by specifing the compiler flag -mcmodel=medium. All libraries that are linked in addition that I have control of are compiled with the -fpic flag.

Still, the error persists and I assume that some libraries I link to are not compiled with PIC.

Here's the error:

/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x12): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_fini'     defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x19): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_init'    defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crti.o: In function    `call_gmon_start':
(.text+0x7): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol      `__gmon_start__'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtbegin.o: In function `__do_global_dtors_aux':
crtstuff.c:(.text+0xb): relocation truncated to fit: R_X86_64_PC32 against `.bss' 
crtstuff.c:(.text+0x13): relocation truncated to fit: R_X86_64_32 against symbol `__DTOR_END__' defined in .dtors section in /usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtend.o
crtstuff.c:(.text+0x19): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x28): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x38): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x3f): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x46): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x51): additional relocation overflows omitted from the output
collect2: ld returned 1 exit status
make: *** [testsme] Error 1

and those are system libraries I link against: -lgfortran -lm -lrt -lpthread

Any clues where to look for the problem??

EDIT: First of all, thank you for the discussion ... To clarify a bit, I have hundreds of functions (each approx 1~MB in size in separate object files) like this:

double func1(std::tr1::unordered_map<int, double> & csc, 
             std::vector<EvaluationNode::Ptr> & ti, 
             ProcessVars & s)
{
    double sum, prefactor, expr;

    prefactor = +s.ds8*s.ds10*ti[0]->value();
    expr =       ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
           1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -
           27/10.*s.x14*s.x15*csc[49304] + 12/5.*s.x14*s.x15*csc[49305] -
           3/10.*s.x14*s.x15*csc[49306] - 4/5.*s.x14*s.x15*csc[49307] +
           21/10.*s.x14*s.x15*csc[49308] + 1/10.*s.x14*s.x15*csc[49309] -
           s.x14*s.x15*csc[51370] - 9/10.*s.x14*s.x15*csc[51371] -
           1/10.*s.x14*s.x15*csc[51372] + 3/5.*s.x14*s.x15*csc[51373] +
           27/10.*s.x14*s.x15*csc[51374] - 12/5.*s.x14*s.x15*csc[51375] +
           3/10.*s.x14*s.x15*csc[51376] + 4/5.*s.x14*s.x15*csc[51377] -
           21/10.*s.x14*s.x15*csc[51378] - 1/10.*s.x14*s.x15*csc[51379] -
           2*s.x14*s.x15*csc[55100] - 9/5.*s.x14*s.x15*csc[55101] -
           1/5.*s.x14*s.x15*csc[55102] + 6/5.*s.x14*s.x15*csc[55103] +
           27/5.*s.x14*s.x15*csc[55104] - 24/5.*s.x14*s.x15*csc[55105] +
           3/5.*s.x14*s.x15*csc[55106] + 8/5.*s.x14*s.x15*csc[55107] -
           21/5.*s.x14*s.x15*csc[55108] - 1/5.*s.x14*s.x15*csc[55109] -
           2*s.x14*s.x15*csc[55170] - 9/5.*s.x14*s.x15*csc[55171] -
           1/5.*s.x14*s.x15*csc[55172] + 6/5.*s.x14*s.x15*csc[55173] +
           27/5.*s.x14*s.x15*csc[55174] - 24/5.*s.x14*s.x15*csc[55175] +
           // ...
           ;

        sum += prefactor*expr;
    // ...
    return sum;
}

The object s is relatively small and keeps the needed constants x14, x15, ..., ds0, ..., etc while ti just returns a double from an external library. As you can see, csc[] is a precomputed map of values which is also evaluated in separate object files (again hundreds with about ~1 MB of size each) of the following form:

void cscs132(std::tr1::unordered_map<int,double> & csc, ProcessVars & s)
{
    {
    double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
           32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x35*s.x45*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.mbpow4*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.x35*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.x45*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35*s.mbpow4*s.mWpowinv2 +
           32*s.x12pow2*s.x35pow2*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35pow2*s.x45*s.mWpowinv2 +
           64*s.x12pow2*s.x35*s.x45*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35*s.x45pow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.mbpow4*s.mWpowinv2 +
           64*s.x12*s.p1p3*s.x15pow2*s.mbpow2*s.mWpowinv2 +
           96*s.x12*s.p1p3*s.x15*s.x25*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.x45*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.mbpow4*s.mWpowinv2 +
           32*s.x12*s.p1p3*s.x25pow2*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.x45*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x45*s.mbpow2 +
           64*s.x12*s.x14*s.x15pow2*s.x35*s.mWpowinv2 +
           96*s.x12*s.x14*s.x15*s.x25*s.x35*s.mWpowinv2 +
           32*s.x12*s.x14*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.x14*s.x15*s.x35pow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x15*s.x35*s.x45*s.mWpowinv2 +
           32*s.x12*s.x14*s.x25pow2*s.x35*s.mWpowinv2 +
           32*s.x12*s.x14*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x25*s.x35pow2*s.mWpowinv2 -
           // ...

       csc.insert(cscMap::value_type(192953, csc19295));
    }

    {
       double csc19296 =      // ... ;

       csc.insert(cscMap::value_type(192956, csc19296));
    }

    // ...
}

and that's about it. The final step then just consists in calling all those func[i] and summing the result up.

Concerning the fact that this is a rather special and unusual case: Yes it is. This is what people have to cope with when trying to do high precision computations for particle physics.

EDIT2: I should also add that x12, x13, etc are not really constants. They are set to specific values, all those functions are run and the result returned, and then a new set of x12, x13, etc is chosen to produce the next value. And this has to be done 10^5 to 10^6 times...

EDIT3: Thank you for the suggestions and the discussion so far... I'll try to roll the loops up upon code generation somehow, not sure how to this exactly, to be honest, but this is the best bet.

Btw, I didn't try to hide behind "this is scientific computing -- no way to optimize". It's just that the basis for this code is something that comes out of a "black box" where I have no real access to and, moreover, the whole thing worked great with simple examples and I mainly feel overwhelmed with what happens in a real world application ...

EDIT4: So, I have managed to reduce the code size of the csc definitions by about one forth by simplifying expressions in a computer algebra system (Mathematica). I see now also some way to reduce it by another order of magnitude or so by applying some other tricks before generating the code (which would bring this part down to about 100MB) and I hope this idea works.

Now related to your answers: I'm trying to roll the loops back up again in the funcs, where a CAS won't help much, but I have already some ideas. For instance, sorting the expressions by the variables like x12, x13,..., parse the cscs with Python and generate tables that relate them to each other. Then I can at least generate these parts as loops. As this seems to be the best solution so far, I mark this as the best answer.

However, I'd like to also give credit to VJo. GCC 4.6 indeed works much better, produces smaller code and is faster. Using the large model works at the code as-is. So technically this is the correct answer, but changing the whole concept is a much better approach.

Thank you all for your suggestions and help. If anyone is interested, I'm going to post the final outcome as soon as I am ready.

REMARKS: Just some remarks to some other answers: The code I'm trying to run does not originate in an expansion of simple functions/algorithms and stupid unnecessary unrolling. What actually happens is that the stuff we start with are pretty complicated mathematical objects and bringing them to a numerically computable form generates these expressions. The problem lies actually in the underlying physical theory. Complexity of intermediate expressions scales factorially, which is well known, but when combining all of this stuff to something physically measureable -- an observable -- it just boils down to only a handful of very small functions that form the basis of the expressions. (There is definitely something "wrong" in this respect with the general and only available ansatz which is called "perturbation theory") We try to bring this ansatz to another level, which is not feasible analytically anymore and where the basis of needed functions is not known. So we try to brute-force it like this. Not the best way, but hopefully one that helps with our understanding of the physics at hand in the end...

LAST EDIT: Thanks to all your suggestions, I've managed to reduce the code size considerably, using Mathematica and a modification of the code generator for the funcs somewhat along the lines of the top answer :)

I have simplified the csc functions with Mathematica, bringing it down to 92MB. This is the irreducible part. The first attempts took forever, but after some optimizations this now runs through in about 10mins on a single CPU.

The effect on the funcs was dramatic: The whole code size for them is down to approximately 9 MB, so the code now totals in the 100 MB range. Now it makes sense to turn optimizations on and the execution is quite fast.

Again, thank you all for your suggestions, I've learned a lot.

So, you already have a program that produces this text:

prefactor = +s.ds8*s.ds10*ti[0]->value();
expr = ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
       1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -...

and

double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
       32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
       32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
       32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -...

right?

If all your functions have a similar "format" (multiply n numbers m times and add the results - or something similar) then I think you can do this:

  • change the generator program to output offsets instead of strings (i.e. instead of the string "s.ds0" it will produce offsetof(ProcessVars, ds0)
  • create an array of such offsets
  • write an evaluator which accepts the array above and the base addresses of the structure pointers and produces an result

The array+evaluator will represent the same logic as one of your functions, but only the evaluator will be code. The array is "data" and can be either generated at runtime or saved on disk and read i chunks or with a memory mapped file.

For your particular example in func1 imagine how you would rewrite the function via an evaluator if you had access to the base address of s and csc and also a vector like representation of the constants and the offsets you need to add to the base addresses to get to x14, ds8 and csc[51370]

You need to create a new form of "data" that will describe how to process the actual data you pass to your huge number of functions.

Does const-correctness give the compiler more room for optimization?

33 votes

I know that it improves readability and makes the program less error-prone, but how much does it improve the performance?

And on a side note, what's the major difference between a reference and a const pointer? I would assume they're stored in the memory differently, but how so?

Thanks.

[Edit: OK so this question is more subtle than I thought at first.]

Declaring a pointer-to-const or reference-of-const never helps any compiler to optimize anything. (Although see the Update at the bottom of this answer.)

The const declaration only indicates how an identifier will be used within the scope of its declaration; it does not say that the underlying object can not change.

Example:

int foo(const int *p) {
    int x = *p;
    bar(x);
    x = *p;
    return x;
}

The compiler cannot assume that *p is not modified by the call to bar(), because p could be (e.g.) a pointer to a global int and bar() might modify it.

If the compiler knows enough about the caller of foo() and the contents of bar() that it can prove bar() does not modify *p, then it can also perform that proof without the const declaration.

But this is true in general. Because const only has an effect within the scope of the declaration, the compiler can already see how you are treating the pointer or reference within that scope; it already knows that you are not modifying the underlying object.

So in short, all const does in this context is prevent you from making mistakes. It does not tell the compiler anything it does not already know, and therefore it is irrelevant for optimization.

What about functions that call foo()? Like:

int x = 37;
foo(&x);
printf("%d\n", x);

Can the compiler prove that this prints 37, since foo() takes a const int *?

No. Even though foo() takes a pointer-to-const, it might cast the const-ness away and modify the int. (This is not undefined behavior.) Here again, the compiler cannot make any assumptions in general; and if it knows enough about foo() to make such an optimization, it will know that even without the const.

The only time const might allow optimizations is cases like this:

const int x = 37;
foo(&x);
printf("%d\n", x);

Here, to modify x through any mechanism whatsoever (e.g., by taking a pointer to it and casting away the const) is to invoke Undefined Behavior. So the compiler is free to assume you do not do that, and it can propagate the constant 37 into the printf(). This sort of optimization is legal for any object you declare const. (In practice, a local variable to which you never take a reference will not benefit, because the compiler can already see whether you modify it within its scope.)

To answer your "side note" question, (a) a const pointer is a pointer; and (b) a const pointer can equal NULL. You are correct that the internal representation (i.e. an address) is most likely the same.

[update]

As Christoph points out in the comments, my answer is incomplete because it does not mention restrict.

Section 6.7.3.1 (4) of the C99 standard says:

During each execution of B, let L be any lvalue that has &L based on P. If L is used to access the value of the object X that it designates, and X is also modified (by any means), then the following requirements apply: T shall not be const-qualified. ...

(Here B is a basic block over which P, a restrict-pointer-to-T, is in scope.)

So if a C function foo() is declared like this:

foo(const int * restrict p)

...then the compiler may assume that no modifications to *p occur during the lifetime of p -- i.e., during the execution of foo() -- because otherwise the Behavior would be Undefined.

So in principle, combining restrict with a pointer-to-const could enable both of the optimizations that are dismissed above. Do any compilers actually implement such an optimization, I wonder? (GCC 4.5.2, at least, does not.)

Note that restrict only exists in C, not C++ (not even C++0x), except as a compiler-specific extension.

Optimizations for pow() with const non-integer exponent?

32 votes

I have hot spots in my code where I'm doing pow() taking up around 10-20% of my execution time.

My input to pow(x,y) is very specific, so I'm wondering if there's a way to roll two pow() approximations (one for each exponent) with higher performance:

  • I have two constant exponents: 2.4 and 1/2.4.
  • When the exponent is 2.4, x will be in the range (0.090473935, 1.0].
  • When the exponent is 1/2.4, x will be in the range (0.0031308, 1.0].
  • I'm using SSE/AVX float vectors. If platform specifics can be taken advantage of, right on!

A maximum error rate around 0.01% is ideal, though I'm interested in full precision (for float) algorithms as well.

I'm already using a fast pow() approximation, but it doesn't take these constraints into account. Is it possible to do better?

Another answer because this is very different from my previous answer, and this is blazing fast. Relative error is 3e-8. Want more accuracy? Add a couple more Chebychev terms. It's best to keep the order odd as this makes for a small discontinuity between 2^n-epsilon and 2^n+epsilon.

#include <stdlib.h>
#include <math.h>

// Returns x^(5/12) for x in [1,2), to within 3e-8 (relative error).
// Want more precision? Add more Chebychev polynomial coefs.
double pow512norm (
   double x)
{
   static const int N = 8;

   // Chebychev polynomial terms.
   // Non-zero terms calculated via
   //   integrate (2/pi)*ChebyshevT[n,u]/sqrt(1-u^2)*((u+3)/2)^(5/12)
   //   from -1 to 1
   // Zeroth term is similar except it uses 1/pi rather than 2/pi.
   static const double Cn[N] = { 
       1.1758200232996901923,
       0.16665763094889061230,
      -0.0083154894939042125035,
       0.00075187976780420279038,
      // Wolfram alpha doesn't want to compute the remaining terms
      // to more precision (it times out).
      -0.0000832402,
       0.0000102292,
      -1.3401e-6,
       1.83334e-7};

   double Tn[N];

   double u = 2.0*x - 3.0;

   Tn[0] = 1.0;
   Tn[1] = u;
   for (int ii = 2; ii < N; ++ii) {
      Tn[ii] = 2*u*Tn[ii-1] - Tn[ii-2];
   }   

   double y = 0.0;
   for (int ii = N-1; ii >= 0; --ii) {
      y += Cn[ii]*Tn[ii];
   }   

   return y;
}


// Returns x^(5/12) to within 3e-8 (relative error).
double pow512 (
   double x)
{
   static const double pow2_512[12] = {
      1.0,
      pow(2.0, 5.0/12.0),
      pow(4.0, 5.0/12.0),
      pow(8.0, 5.0/12.0),
      pow(16.0, 5.0/12.0),
      pow(32.0, 5.0/12.0),
      pow(64.0, 5.0/12.0),
      pow(128.0, 5.0/12.0),
      pow(256.0, 5.0/12.0),
      pow(512.0, 5.0/12.0),
      pow(1024.0, 5.0/12.0),
      pow(2048.0, 5.0/12.0)
   };

   double s;
   int iexp;

   s = frexp (x, &iexp);
   s *= 2.0;
   iexp -= 1;

   div_t qr = div (iexp, 12);
   if (qr.rem < 0) {
      qr.quot -= 1;
      qr.rem += 12;
   }

   return ldexp (pow512norm(s)*pow2_512[qr.rem], 5*qr.quot);
}

Addendum: What's going on here?
Per request, the following explains how the above code works.

Overview
The above code defines two functions, double pow512norm (double x) and double pow512 (double x). The latter is the entry point to the suite; this is the function that user code should call to calculate x^(5/12). The function pow512norm(x) uses Chebyshev polynomials to approximate x^(5/12), but only for x in the range [1,2]. (Use pow512norm(x) for values of x outside that range and the result will be garbage.)

The function pow512(x) splits the incoming x into a pair (double s, int n) such that x = s * 2^n and such that 1≤s<2. A further partitioning of n into (int q, unsigned int r) such that n = 12*q + r and r is less than 12 lets me split the problem of finding x^(5/12) into parts:

  1. x^(5/12)=(s^(5/12))*((2^n)^(5/12)) via (u*v)^a=(u^a)*(v^a) for positive u,v and real a.
  2. s^(5/12) is calculated via pow512norm(s).
  3. (2^n)^(5/12)=(2^(12*q+r))^(5/12) via substitution.
  4. 2^(12*q+r)=(2^(12*q))*(2^r) via u^(a+b)=(u^a)*(u^b) for positive u, real a,b.
  5. (2^(12*q+r))^(5/12)=(2^(5*q))*((2^r)^(5/12)) via some more manipulations.
  6. (2^r)^(5/12) is calculated by the lookup table pow2_512.
  7. Calculate pow512norm(s)*pow2_512[qr.rem] and we're almost there. Here qr.rem is the r value calculated in step 3 above. All that is needed is to multiply this by 2^(5*q) to yield the desired result.
  8. That is exactly what the math library function ldexp does.

Function Approximation
The goal here is to come up with an easily computable approximation of f(x)=x^(5/12) that is 'good enough' for the problem at hand. Our approximation should be close to f(x) in some sense. Rhetorical question: What does 'close to' mean? Two competing interpretations are minimizing the mean square error versus minimizing the maximum absolute error.

I'll use a stock market analogy to describe the difference between these. Suppose you want to save for your eventual retirement. If you are in your twenties, the best thing to do is to invest in stocks or stock market funds. This is because over a long enough span of time, the stock market on average beats any other investment scheme. However, we've all seen times when putting money into stocks is a very bad thing to do. If you are in your fifties or sixties (or forties if you want to retire young) you need to invest a bit more conservatively. Those downswings can wreak have on your retirement portfolio.

Back to function approximation: As the consumer of some approximation, you are typically worried about the worst-case error rather than the performance "on average". Use some approximation constructed to give the best performance "on average" (e.g. least squares) and Murphy's law dictates that your program will spend a whole lot of time using the approximation exactly where the performance is far worse than average. What you want is a minimax approximation, something that minimizes the maximum absolute error over some domain. A good math library will take a minimax approach rather than a least squares approach because this lets the authors of the math library give some guaranteed performance of their library.

Math libraries typically use a polynomial or a rational polynomial to approximate some function f(x) over some domain a≤x≤b. Suppose the function f(x) is analytic over this domain and you want to approximate the function by some polynomial p(x) of degree N. For a given degree N there exists some magical, unique polynomial p(x) such that p(x)-f(x) has N+2 extrema over [a,b] and such that the absolute values of these N+2 extrema are all equal to one another. Finding this magical polynomial p(x) is the holy grail of function approximators.

I did not find that holy grail for you. I instead used a Chebyshev approximation. The Chebyshev polynomials of the first kind are an orthogonal (but not orthonormal) set of polynomials with some very nice features when it comes to function approximation. The Chebyshev approximation oftentimes is very close to that magical polynomial p(x). (In fact, the Remez exchange algorithm that does find that holy grail polynomial typically starts with a Chebyshev approximation.)

pow512norm(x)
This function uses Chebyshev approximation to find some polynomial p*(x) that approximates x^(5/12). Here I'm using p*(x) to distinguish this Chebyshev approximation from the magical polynomial p(x) described above. The Chebyshev approximation p*(x) is easy to find; finding p(x) is a bear. The Chebyshev approximation p*(x) is sum_i Cn[i]*Tn(i,x), where the Cn[i] are the Chebyshev coefficients and Tn(i,x) are the Chebyshev polynomials evaluated at x.

I used Wolfram alpha to find the Chebyshev coefficients Cn for me. For example, this calculates Cn[1]. The first box after the input box has the desired answer, 0.166658 in this case. That's not as many digits as I would like. Click on 'more digits' and voila, you get a whole lot more digits. Wolfram alpha is free; there is a limit on how much computation it will do. It hits that limit on higher order terms. (If you buy or have access to mathematica you will be able to calculate those high-order coefficients to a high degree of precision.)

The Chebyshev polynomials Tn(x) are calculated in the array Tn. Beyond giving something very close to magical polynomial p(x), another reason for using Chebyshev approximation is that the values of those Chebyshev polynomials are easily calculated: Start with Tn[0]=1 and Tn[1]=x, and then iteratively calculate Tn[i]=2*x*Tn[i-1] - Tn[i-2]. (I used 'ii' as the index variable rather than 'i' in my code. I never use 'i' as a variable name. How many words in the English language have an 'i' in the word? How many have two consecutive 'i's?)

pow512(x)
pow512 is the function that user code should be calling. I already described the basics of this function above. A few more details: The math library function frexp(x) returns the significand s and exponent iexp for the input x. (Minor issue: I want s between 1 and 2 for use with pow512norm but frexp returns a value between 0.5 and 1.) The math library function div returns the quotient and remainder for integer division in one swell foop. Finally, I use the math library function ldexp to put the three parts together to form the final answer.

Is memory allocation a system call?

27 votes

Is memory allocation a system call? For example, malloc and new. Is the heap shared by different processes and managed by the OS. What about private heap? If memory allocation in the heap is managed by the OS, how expensive is this?

I would also like to have some link to places where I can read more about this topic.

In general, malloc and new do not perform a system call at each invocation. However, they use a lower-level mechanism to allocate large pages of memory. On Windows, the lower mechanism is VirtualAlloc(). I believe on POSIX systems, this is somewhat equivalent to mmap(). Both of these perform a system call to allocate memory to the process at the OS level. Subsequent allocations will use smaller parts of those large pages without incurring a system call.

The heap is normally inner-process and is not shared between processes. If you need this, most OSes have an API for allocating shared memory. A portable wrapper for these APIs is available in the Boost.Interprocess library.

If you would like to learn more about memory allocation and relationship with the OS, you should take a look at a good book on operating systems. I always suggest Modern Operating Systems by Andrew S. Tanenbaum as it is very easy to read.

Are (bool)(i & 1) and i % 2 == 1 same?

26 votes

Are (bool)(i & 1) and i % 2 == 1 always same where i is int?

Note: saying always I mean for all platforms (even when a byte is 16 bit) and for all standards of C and C++.

Edit:

For all standards of C and C++ where bool exist.

No.

1s' complement representation of int, the representation of -1 is 1 ... 10, so they differ.

Anyway, i % 2 can be negative for negative i (indeed it's required to be in C99 when it's not 0), and hence not equal to 1 for negative odd numbers.

sizeof taking two arguments

25 votes

In C.1.3 of the C++ IS (2003. It's in the C++11 IS, too), the standard points out a difference between ISO C and C++; namely, for

char arr[100];

sizeof(0, arr) returns sizeof(char*) in C, but 100 in C++.

I can find no documentation for sizeof taking two arguments. The obvious fallback is the comma operator, but I don't think so: sizeof(arr) in C is 100; sizeof(0, arr) is sizeof(char*). Both sizeof(0, arr) and sizeof(arr) are 100 in C++.

I may be missing the whole point of the IS in this context. Can anyone help? This is similar to a question discussed back in '09, but no one referred to the IS, and I don't think the correct answer was given.


Edit: Actually, the IS is talking about the comma operator. So, for some reason (0, arr) returns a char* in C, but a char[100] in C++. Why?

In C then the array is decaying to a pointer, because of the different specification of the comma operator with relation to rvalues and lvalues (not the only place such a difference can be found). In C++ then the array stays an array, yielding the correct result.

What does the tilde (~) in macros mean?

25 votes

Seen on this site, the code shows macro invocations using a tilde in parentheses:

HAS_COMMA(_TRIGGER_PARENTHESIS_ __VA_ARGS__ (~))
//                                          ^^^

What does it mean / do? I suspect it to just be an empty argument, but I'm not sure. Is it maybe specific to C(99) like the __VA_ARGS__ is specific to C99 and existent in C++?

On the introduction page of Boost.Preprocessor, an example is given in A.4.1.1 Horizontal Repetition

#define TINY_print(z, n, data) data

#define TINY_size(z, n, unused)                                 \
  template <BOOST_PP_ENUM_PARAMS(n, class T)>                   \
  struct tiny_size<                                             \
      BOOST_PP_ENUM_PARAMS(n,T)                                 \
      BOOST_PP_COMMA_IF(n)                                      \
      BOOST_PP_ENUM(                                            \
          BOOST_PP_SUB(TINY_MAX_SIZE,n), TINY_print, none)      \
  >                                                             \
    : mpl::int_<n> {};

BOOST_PP_REPEAT(TINY_MAX_SIZE, TINY_size, ~) // Oh! a tilde!

#undef TINY_size
#undef TINY_print

An explanation is provided below:

The code generation process is kicked off by calling BOOST_PP_REPEAT, a higher-order macro that repeatedly invokes the macro named by its second argument (TINY_size). The first argument specifies the number of repeated invocations, and the third one can be any data; it is passed on unchanged to the macro being invoked. In this case, TINY_size doesn't use that data, so the choice to pass ~ was arbitrary. [5]

(emphasis mine)

And there is the note:

[5] ~ is not an entirely arbitrary choice. Both @ and $ might have been good choices, except that they are technically not part of the basic character set that C++ implementations are required to support. An identifier like ignored might be subject to macro expansion, leading to unexpected results.

The tilde, therefore, is simply a place holder because an argument is required, but none is necessary. Since any user-defined identifier wannabe could be expanded, you need to use something else.

It turns out that ~ is pretty much unused (binary negation is not that often called) in comparison to + or - for example, so there is little chance of confusion. Once you've settled on this, using it consistently gives it a new meaning to the tilde; like using operator<< and operator>> for streaming data has become a C++ idiom.

Can different optimization levels lead to functionally different code?

20 votes

I am curious about the liberties that a compiler has when optimizing. Let's limit this question to GCC and C/C++ (any version, any flavour of standard):

Is it possible to write code which behaves differently depending on which optimization level it was compiled with?

The example I have in mind is printing different bits of text in various constructors in C++ and getting a difference depending on whether copies are elided (though I've not been able to make such a thing work).

Counting clock cycles is not permitted. If you have an example for a non-GCC compiler, I'd be curious, too, but I can't check it. Bonus points for an example in C. :-)

Edit: The example code should be standard compliant and not contain undefined behaviour from the outset.

Edit 2: Got some great answers already! Let me up the stakes a bit: The code must constitute a well-formed program and be standards-compliant, and it must compile to correct, deterministic programs in every optimization level. (That excludes things like race-conditions in ill-formed multithreaded code.) Also I appreciate that floating point rounding may be affected, but let's discount that.

I just hit 800 reputation, so I think I shall blow 50 reputation as bounty on the first complete example to conform to (the spirit) of those conditions; 25 if it involves abusing strict aliasing. (Subject to someone showing me how to send bounty to someone else.)

The portion of the C++ standard that applies is §1.9 "Program execution". It reads, in part:

conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below. ...

A conforming implementation executing a well-formed program shall produce the same observable behavior as one of the possible execution sequences of the corresponding instance of the abstract machine with the same program and the same input. ...

The observable behavior of the abstract machine is its sequence of reads and writes to volatile data and calls to library I/O functions. ...

So, yes, code may behave differently at different optimization levels, but (assuming that all levels produce a conforming compiler), but they cannot behave observably differently.

EDIT: Allow me to correct my conclusion: Yes, code may behave differently at different optimization levels as long as each behavior is observably identical to one of the behaviors of the standard's abstract machine.

strcmp for empty string

17 votes

I was reviewing some code and I saw someone do a

if (0 == strcmp(foo,""))

I am curious because I think it would be faster to do a

if (foo[0] == '\0')

Is this correct or is strcmp optimized enough to make them the same.

(I realize that even if there was some difference it would be small, but am thinking you save at least a few instructions by using my method.)

You're right: since calling strcmp() adds up the stack management and the memory jump to the actual strcmp instructions, you'll gain a few instructions by just checking the first byte of your string.

For your curiosity, you can check the strcmp() code here: http://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcmp.c;h=bd53c05c6e21130b091bd75c3fc93872dd71fe4b;hb=HEAD

(I thought the code would be filled with #ifdef and obscure __GNUSOMETHING, but it's actually rather simple!)

c & c++ default global variable linkage, multiple declaration & definition problem

17 votes

For example:

code1.c / .cpp

int a;

// ... and so on

code2.c / .cpp

int a;

int main(void) {
    return 0;
}

go to compile:

$gcc code1.c code2.c      # this is fine
$

$g++ code1.cpp code2.cpp  # this is dead
/tmp/ccLY66HQ.o:(.bss+0x0): multiple definition of `a'
/tmp/ccnIOmPC.o:(.bss+0x0): first defined here
collect2: ld returned 1 exit status

Is there any global variable linkage difference between C & C++?

It's not strictly legal. int a; is a tentative definition in C. You are allowed multiple tentative definitions and at most one non-tentative definition per translation unit of each object with external linkage in C, but only one definition across all translation units in a program.

It is a commonly implemented extension to allow tentative definitions across multiple translation units in C so long as not more than one translation unit contains a non-tentative definition, but it's not strictly standard.

In C++ int a; is just a definition - there's no concept of tentative - and it's still illegal to have multiple definitions of an object across the translation units of a program.

For the C case, you may wish to look at this question.

If volatile is useless for threading, why do atomic operations require pointers to volatile data?

17 votes

I've been reading from many sources that the volatile keyword is not helpful in multithreaded scenarios. However, this assertion is constantly challenged by atomic operation functions that accept volatile pointers.

For instance, on Mac OS X, we have the OSAtomic function family:

SInt32 OSIncrementAtomic(volatile SInt32 *address);
SInt32 OSDrecrementAtomic(volatile SInt32 *address);
SInt32 OSAddAtomic(SInt32 amount, volatile SInt32 *address);
// ...

And it seems that there is a similar usage of the volatile keyword on Windows for Interlocked operations:

LONG __cdecl InterlockedIncrement(__inout LONG volatile *Addend);
LONG __cdecl InterlockedDecrement(__inout LONG volatile *Addend);

It also seems that in C++11, atomic types have methods with the volatile modifier, which must somehow mean that the volatile keyword has some kind of relationship with atomicity.

So, what am I missing? Why do OS vendors and standard library designers insist on using the volatile keyword for threading purposes if it's not useful?

It suddenly came to me that I simply misinterpreted the meaning of volatile*. Much like const* means the pointee shouldn't change, volatile* means that the pointee shouldn't be cached. This is an additional constraint that can be freely added: as much as you can cast a char* to a const char*, you can cast an int* to a volatile int*.

So applying the volatile modifier to the pointees simply ensures that atomic functions can be used on volatile variables. For non-volatile variables, the cast is implicit. My mistake was to interpret the presence of the keyword in the prototypes as an incentive to use it rather than as a convenience to those using it.

Is "inline" without "static" or "extern" ever useful in C99?

16 votes

When I try to build this code

inline void f() {}

int main()
{
    f();
}

using the command line

gcc -std=c99 -o a a.c

I get a linker error (undefined reference to f). The error vanishes if I use static inline or extern inline instead of just inline, or if I compile with -O (so the function is actually inlined).

This behaviour seems to be defined in paragraph 6.7.4 (6) of the C99 standard:

If all of the file scope declarations for a function in a translation unit include the inline function specifier without extern, then the definition in that translation unit is an inline definition. An inline definition does not provide an external definition for the function, and does not forbid an external definition in another translation unit. An inline definition provides an alternative to an external definition, which a translator may use to implement any call to the function in the same translation unit. It is unspecified whether a call to the function uses the inline definition or the external definition.

If I understand all this correctly, a compilation unit with a function defined inline as in the above example only compiles consistently if there is also an external function with the same name, and I never know if my own function or the external function is called.

Isn't this behaviour completely daft? Is it ever useful to define a function inline without static or extern in C99? Am I missing something?

Summary of answers

Of course I was missing something, and the behaviour isn't daft. :)

As Nemo explains, the idea is to put the definition of the function

inline void f() {}

in the header file and only a declaration

extern inline void f();

in the corresponding .c file. Only the extern declaration triggers the generation of externally visible binary code. And there is indeed no use of inline in a .c file -- it's only useful in headers.

As the rationale of the C99 committee quoted in Jonathan's answer explicates, inline is all about compiler optimisations that require the definition of a function to be visible at the site of a call. This can only be achieved by putting the definition in the header, and of course a definition in a header must not emit code everytime it is seen by the compiler. But since the compiler is not forced to actually inline a function, an external definition must exist somewhere.

Actually this excellent answer also answers your question, I think:

extern inline

The idea is that "inline" can be used in a header file, and then "extern inline" in a .c file. "extern inline" is just how you instruct the compiler which object file should contain the (externally visible) generated code.

[update, to elaborate]

I do not think there is any use for "inline" (without "static" or "extern") in a .c file. But in a header file it makes sense, and it requires a corresponding "extern inline" declaration in some .c file to actually generate the stand-alone code.

Recursion without recursive call?

14 votes

Found this on /prog/. I actually GDB'd it, and yes, it was truly a recursion. But how did it happen?

// This works on 32-bit x86 Linux with gcc as long as you don't enable optimization.

#include <stdio.h>
#include <stdlib.h>

static void factorial(int in, int *out)
{
  *(&in-1)-=5-5*(1/in);
  *out*=in--;
}

int main(int argc, char **argv) 
{
  int result=1;
  int number=0;

  if (argc!=2) 
    exit(1);

  number=atoi(argv[1]);
  if (number<1)
    exit(2);

  factorial(number, &result);
  printf("%d! = %d\n", number, result);
  return 0;
}


$ ./factorial 3
3! = 6

$ ./factorial 5
5! = 120

Sweet. ;)

This is extremely non-portable code that works only on x86. What it's doing is changing the return address on the stack so that if in>1, the function returns not to the instruction following the call instruction, but to the call instruction itself. A call instruction on x86 is five bytes (one opcode plus the 4-byte address of the call destination), so five needs to be subtracted from the return address.

This

*(&in-1)-=5-5*(1/in);

is just an obfuscated way of saying

if(in>1)
    *(&in-1)-=5;

And &in-1 is where the return address lives on the stack.

Interview Question... Trying to work it out, but couldn't get an efficient solution...

14 votes

I am stuck in one interview question.. The question is,

*given two arrays A and B. A has integers unsorted. B has the same length as A and its values are in the set {-1,0,1}

you have to return an array C with the following processing on A.

if B[i] has 0 then C[i] must have A[i]
if B[i] has -1 then A[i] must be in C within the sub array C[0] - C[i-1] ie. left subarray
if B[i] has 1 then A[i] must be in C within the sub array C[i+1] - C[length(A)] ie right subarray.

if no such solution exists then printf("no solution");*

I applied following logics:-

int indMinus1 = n-1;
int indPlus1 = 0;

//while(indPlus1 < n && indMinus1 > 0)
while(indPlus1 < indMinus1)
{
    while(b[indMinus1] != -1)   {
        if(b[indMinus1] == 0)
            c[indMinus1] = a[indMinus1];
        indMinus1--;
    }
    while(b[indPlus1] != +1)    {
        if(b[indPlus1] == 0)
            c[indPlus1] = a[indPlus1];
        indPlus1++;
    }

    c[indMinus1] = a[indPlus1];
    c[indPlus1] = a[indMinus1];
    b[indMinus1] = 0;
    b[indPlus1] = 0;
    indMinus1--;
    indPlus1++;
}

But this will not going to work,, for some cases like {1,2,3} >> {1,-1,-1}... One output is possible i.e. {2,3,1};

Please help.... does their any algorithm technique available for this problem?

Correct Solution Code

int arrange(int a[], int b[], int c[], int n)
{

for (int i = 0; i < n; ++i) {
    if(b[i] == 0)
        c[i] = a[i];
}

int ci = 0;
for (int i = 0; i < n; ++i) {
    if(b[i] == -1)  {
        while(c[ci] != 0 && ci < i)
            ci ++;
        if(c[ci] != 0 || ci >= i)
            return -1;
        c[ci] = a[i];
        ci++;
    }
}

for (int i = 0; i < n; ++i) {
    if(b[i] == 1)   {
        while(c[ci] != 0 && ci < n)
            ci ++;
        if(c[ci] != 0 || ci <= i)
            return -1;
        c[ci] = a[i];
        ci++;
    }
    }
    return 0;
}

I suggest the following algorithm:
1. Initially consider all C[ i ] as empty nests.
2. For each i where B[ i ] = 0 we put C[ i ] = A[ i ]
3. Go through array from left to right, and for each i where B[ i ] = -1 put
C[ j ] = A[ i ], where 0 <= j < i is the smallest index for which C[ j ] is still empty. If no such index exists, the answer is impossible.
4. Go through array from right to left, and for each i where B[ i ] = 1 put
C[ j ] = A[ i ], where i < j < n is the greatest index for which C[ j ] is still empty. If no such index exists, the answer is impossible.

Why do we put A[ i ] to the leftmost position in step 2 ? Well, we know that we must put it to some position j < i. On the other hand, putting it leftmost will increase our changes to not get stucked in step 3. See this example for illustration:

A: [ 1, 2, 3 ]
B: [ 1, 1,-1 ]

Initially C is empty: C:[ _, _, _ ]
We have no 0-s, so let's pass to step 2.
We have to choose whether to place element A[ 2 ] to C[ 0 ] or to C[ 1 ].
If we place it not leftmost, we will get the following situation:
C: [ _, 3, _ ]
And... Oops, we are unable to arrange elements A[ 0 ] and A[ 1 ] due to insufficient place :(
But, if we put A[ 2 ] leftmost, we will get
C: [ 3, _, _ ], And it is pretty possible to finish the algorithm with
C: [ 3, 1, 2 ] :)

Complexity:
What we do is pass three times along the array, so the complexity is O(3n) = O(n) - linear.

Further example:

A: [ 1, 2, 3 ]
B: [ 1, -1, -1 ]

Let's go through the algorithm step by step:
1. C: [ _, _, _ ]
2. Empty, because no 0-s in B
3. Putting A[ 1 ] and A[ 2 ] to leftmost empty positions:

C: [ 2, 3, _ ]

4. Putting A[ 0 ] to the rightmost free (in this example the only one) free position:

C: [ 2, 3, 1 ]

Which is the answer.

Source code:

#include <iostream>
#include <string>
#include <vector>

using namespace std;


vector< int > a;
vector< int > b;
vector< int > c;
int n;

bool solve ()
{
    int i;
    for( i = 0; i < n; ++i )
        c[ i ] = -1;
    for( i = 0; i < n; ++i )
        if( b[ i ] == 0 )
            c[ i ] = a[ i ];
    int leftmost = 0;
    for( i = 0; i < n; ++i )
        if( b[ i ] == -1 )
        {
            for( ; leftmost < i && c[ leftmost ] != -1; ++leftmost ); // finding the leftmost free cell
            if( leftmost >= i )
                return false; // not found
            c[ leftmost++ ] = a[ i ];
        }
    int rightmost = n - 1;
    for( i = n - 1; i >= 0; --i )
        if( b[ i ] == 1 )
        {
            for( ; rightmost > i && c[ rightmost ] != -1; --rightmost ); // finding the rightmost free cell
            if( rightmost <= i )
                return false; // not found;
            c[ rightmost-- ] = a[ i ];
        }
    return true;
}


int main ()
{
    cin >> n;
    a.resize(n);
    b.resize(n);
    c.resize(n);
    int i;
    for( i = 0; i < n; ++i )
        cin >> a[ i ];
    for( i = 0; i < n; ++i )
        cin >> b[ i ];
    if( !solve() )
        cout << "Impossible";
    else
        for( i = 0; i < n; ++i )
            cout << c[ i ] << ' ';
    cout << endl;
    return 0;
}

Whats possible in a for loop

13 votes

So today I went to an interview and one of the questions was the following (C# context).

//Print the output for the following code:
for (int i = 10, j = 0; j <= 10; j++, i--)
{
    if (i > j)
        Console.WriteLine(j.ToString());
}

I have never seen such a construct before and having asked my colleagues, 4 of 5 at my workplace didn't know either (Perhaps more a reflection on us but I digress). Using some basic logic, I was able to answer the question correctly but this knowledge has radically altered my understanding of how for loops can be structured.

So I guess my question boils down to this.

  1. Do all C syntax based languages implement this functionality? IE: C, C++, Java, javascript etc.
  2. Where does this syntax stem from?
  3. Are there any other "not well known" structures that a for loop can take?
  4. Is writing code like above considered bad practice given how hard it is to read?
  5. Are there any good real world examples where such a structure is required?

for (statement1; statement2; statement3)
{
     /* body */
}

(1) First the statement1 is executed.

(2) Next statement2 is executed.

(3) If the evaluation of statement2 is true then the body is executed

(4) Then statement3 is executed.

(5) Repeat from step (2)

          |                  +<-----------------+
          |                  |                  ^
          V                  V                  |
 for (  (s1); -------->(s2 true? | false?);    (s3) )
 {                           |       |          ^
                             |       |          |
                             |       |          |
                             V       |          |
                          (body)-----|--------->+
 }                                   |
                                     |
                                     V
                                 (come out)

The structure you have shown is the same normal structure as above. The statement n could be any statement. In your example, you have separated by comma operators in statement1 and statement3. You can separate any number of statements by comma operators.

Generally for loops are used with the statement1 with initialization as it is executed only once. The statement2 is used for the loop termination condition checking, because the evaluation value of this statement is used to decide if to enter the body of break out. And the statement3 is used for update of the loop termination variable as it is executed after the body. But generally they could be used in any way.

First statement1 is i=10, j=0; this initializes the variables. Next in the statement2 is j <= 10 if this is true then the body is executed. After the body is executed, statement3 which is i--,j++ is executed. The loop will iterate 11 times 0 to 10. But will print 5 times, as at one point i and j will become same and the if (i > j) will evaluate false.

EDIT Here is an example where it might be used, not much practical but a sample use, to check for a palindrome string.

  int i, j, n, flag;
  char str[128];

  printf ("\nEnter string: ");
  scanf ("%s", &str);
  n = strlen (str);


for (flag=1, i=n-1, j=0; j<n/2; j++, i--)
{
  if (str[i] != str[j])
  {
    flag = 0;
    break;
  }
}

if (flag)
 printf ("\n\"%s\" is a palindrome");
else
 printf ("\n\"%s\" is not a palindrome");

We should always try to write code which is easy to read and which does not create confusion. This helps the code writer as well as others who read the code.

Existing threadpool C implementation

13 votes

What open-source implementation(s) in C for a pthreads thread pool would you recommend ?

Additional points if this implementation is :

  • Light-weight: glib, APR, NSPR and others come with a big buy-in, I'd rather have just 2 files (header and implementation).
  • Tested on several platforms (Linux, BSD, Mac OS X, etc.).
  • Still maintained.

I worked on making something I'd be able to use and I've published it on github: it's unimaginably called threadpool.

Python equivialent of C programming techniques (while loops)

12 votes

In the C programming language, I often have done the following:

while ((c = getch()) != EOF) {
 /* do something with c */
}

In Python, I have not found anything similar, since I am not allowed to set variables inside the evaluated expression. I usually end up with having to setup the evaluated expression twice!

c = sys.stdin.read(1)
while not (c == EOF):
 # Do something with c
 c = sys.stdin.read(1)

In my attempts to find a better way, I've found a way that only require to setup and the evaluated expression once, but this is getting uglier...

while True:
 c = sys.stdin.read(1)
 if (c == EOF): break
 # do stuff with c

So far I've settled with the following method for some of my cases, but this is far from optimal for the regular while loops...:

class ConditionalFileObjectReader:
 def __init__(self,fobj, filterfunc):
  self.filterfunc = filterfunc
  self.fobj = fobj
 def __iter__(self):
  return self
 def next(self):
  c = self.fobj.read(1)
  if self.filterfunc(c): raise StopIteration
  return c

for c in ConditionalFileObjectReader(sys.stdin,lambda c: c == EOF):
 print c

All my solutions to solve a simple basic programming problem has become too complex... Do anyone have a suggestion how to do this the proper way?

It's possible to write much simpler code in place of your ConditionalFileObjectReader, considering that EOF seems to be what you care about, rather than any arbitrary condition:

def readbytes(file):
    while True:
        c = file.read(1)
        if c == '':
            return
        yield c

for c in readbytes(sys.stdin):
    print c

So you still have 'while True ... break', which seems to be the preferred loop in Python[*], but at least you only have it once to solve the whole class of problem, "how to iterate over the bytes in a file-like object without blocking/buffering each line", and you have it in a short loop that doesn't "do stuff with c" - that's a separate concern.

Inspired by Wallacoloo's example with iter, similar to the above you could produce something more general than iter:

def until(nextvalue, pred):
    while True:
        value = nextvalue()
        if pred(value):
            return
        yield value

for c in until(lambda: sys.stdin.read(1), lambda x: x == ''):
    print c

I'm not sure whether I like this or not, but might be worth playing with. It tries to solve the general problem "iterate over the return values of some function, until a return value satisfies some condition".

[*] dare I say, the Pythonic equivalent of fancy loop syntax in other languages?