Best c++ questions in June 2011

Can a local variable's memory be accessed outside its scope?!

281 votes

Possible Duplicate:
Returning the address of local or temporary variable

I have the following code.

int * foo()
{
    int a = 5;
    return &a;
}

int main()
{
    int* p = foo();
    cout << *p;
    *p = 8;
    cout << *p;
}

And the code is just running with no runtime exceptions!

The output was 5 8

How can it be? Isn't the memory of a local variable inaccessible outside its function?

How can it be? Isn't the memory of a local variable inaccessible outside its function?

You rent a hotel room. You put a book in the top drawer of the bedside table and go to sleep. You check out the next morning, but "forget" to give back your key. You steal the key!

A week later, you return to the hotel, do not check in, sneak into your old room with your stolen key, and look in the drawer. Your book is still there. Astonishing!

How can that be? Isn't the contents of a hotel room drawer inaccessible if you haven't rented the room?

Well, obviously that scenario can happen in the real world no problem. There is no mysterious force that causes your book to disappear when you are no longer authorized to be in the room. Nor is there a mysterious force that prevents you from entering a room with a stolen key.

The hotel management is not required to remove your book. You didn't make a contract with them that said that if you leave stuff behind, they'll shred it for you. If you illegally re-enter your room with a stolen key to get it back, the hotel security staff is not required to catch you sneaking in. You didn't make a contract with them that said "if I try to sneak back into my room later, you are required to stop me." Rather, you signed a contract with them that said "I promise not to sneak back into my room later", a contract which you broke.

In this situation anything can happen. The book can be there -- you got lucky. Someone else's book can be there and yours could be in the hotel's furnace. Someone could be there right when you come in, tearing your book to pieces. The hotel could have removed the table and book entirely and replaced it with a wardrobe. The entire hotel could be just about to be torn down and replaced with a football stadium, and you are going to die in an explosion while you are sneaking around.

You don't know what is going to happen; when you checked out of the hotel and stole a key to illegaly use later, you gave up the right to live in a predictable, safe world because you chose to break the rules of the system.

C++ is not a safe language. It will cheerfully allow you to break the rules of the system. If you try to do something illegal and foolish like going back into a room you're not authorized to be in and rummaging through a desk that might not even be there anymore, C++ is not going to stop you. Safer languages than C++ solve this problem by restricting your power -- by having much stricter control over keys, for example.

UPDATE

Holy goodness, this answer is getting a lot of attention. (I'm not sure why -- I considered it to be just a "fun" little analogy, but whatever.)

I thought it might be germane to update this a bit with a few more technical thoughts.

Compilers are in the business of generating code which manages the storage of the data manipulated by that program. There are lots of different ways of generating code to manage memory, but over time two basic techniques have become entrenched.

The first is to have some sort of "long lived" storage area where the "lifetime" of each byte in the storage -- that is, the period of time when it is validly associated with some program variable -- cannot be easily predicted ahead of time. The compiler generates calls into a "heap manager" that knows how to dynamically allocate storage when it is needed and reclaim it when it is no longer needed.

The second is to have some sort of "short lived" storage area where the lifetime of each byte in the storage is well known, and, in particular, lifetimes of storages follow a "nesting" pattern. That is, the allocation of the longest-lived of the short-lived variables strictly overlaps the allocations of shorter-lived variables that come after it.

Local variables follow the latter pattern; when a method is entered, its local variables come alive. When that method calls another method, the new method's local variables come alive. They'll be dead before the first method's local variables are dead. The relative order of the beginnings and endings of lifetimes of storages associated with local variables can be worked out ahead of time.

For this reason, local variables are usually generated as storage on a "stack" data structure, because a stack has the property that the first thing pushed on it is going to be the last thing popped off.

It's like the hotel decides to only rent out rooms sequentially, and you can't check out until everyone with a room number higher than you has checked out.

So let's think about the stack. In many operating systems you get one stack per thread and the stack is allocated to be a certain fixed size. When you call a method, stuff is pushed onto the stack. If you then pass a pointer to the stack back out of your method, as the original poster does here, that's just a pointer to the middle of some entirely valid million-byte memory block. In our analogy, you check out of the hotel; when you do, you just checked out of the highest-numbered occupied room. If no one else checks in after you, and you go back to your room illegally, all your stuff is guaranteed to still be there in this particular hotel.

We use stacks for temporary stores because they are really cheap and easy. An implementation of C++ is not required to use a stack for storage of locals; it could use the heap. It doesn't, because that would make the program slower.

An implementation of C++ is not required to leave the garbage you left on the stack untouched so that you can come back for it later illegally; it is perfectly legal for the compiler to generate code that turns back to zero everything in the "room" that you just vacated. It doesn't because again, that would be expensive.

An implementation of C++ is not required to ensure that when the stack logically shrinks, the addresses that used to be valid are still mapped into memory. The implementation is allowed to tell the operating system "we're done using this page of stack now. Until I say otherwise, issue an exception that destroys the process if anyone touches the previously-valid stack page". Again, implementations do not actually do that because it is slow and unnecessary.

Instead, implementations let you make mistakes and get away with it. Most of the time. Until one day something truly awful goes wrong and the process explodes.

This is problematic. There are a lot of rules and it is very easy to break them accidentally. I certainly have many times. And worse, the problem often only surfaces when memory is detected to be corrupt billions of nanoseconds after the corruption happened, when it is very hard to figure out who messed it up.

More memory-safe languages solve this problem by restricting your power. In "normal" C# there simply is no way to take the address of a local and return it or store it for later. You can take the address of a local, but the language is cleverly designed so that it is impossible to use it after the lifetime of the local ends. In order to take the address of a local and pass it back, you have to put the compiler in a special "unsafe" mode, and put the word "unsafe" in your program, to call attention to the fact that you are probably doing something dangerous that could be breaking the rules.

For further reading:

  • What if C# did allow returning references? Coincidentally that is the subject of today's blog post:

http://blogs.msdn.com/b/ericlippert/archive/2011/06/23/ref-returns-and-ref-locals.aspx

  • Why do we use stacks to manage memory? Are value types in C# always stored on the stack? How does virtual memory work? And many more topics in how the C# memory manager works. Many of these articles are also germane to C++ programmers:

http://blogs.msdn.com/b/ericlippert/archive/tags/memory+management/

C++0x introduces a standardized memory model. What does it mean? And how is it going to affect C++ programming?

186 votes

As we know C++0x introduces a standardized memory model, but I don't know what does it exactly mean? And how is it going to affect C++ programming?

Herb Sutter says here that,

The memory model means that C++ code now has a standardized library to call regardless of who made the compiler and on what platform it's running. There's a standard way to control how different threads talk to the processor's memory.

"When you are talking about splitting [code] across different cores that's in the standard, we are talking about the memory model. We are going to optimize it without breaking the following assumptions people are going to make in the code," Sutter said.

Well, I can memorize this and similar paragraphs available online (as I've my own memory model since birth :P) and can even post as answer to questions asked by others, but to be honest, I don't exactly understand this.

So, what I basically want to know is, C++ programmers used to develop multi-threaded applications even before, so how does it matter if its POSIX threads, or Windows threads, or C++0x threads? What are the benefits? I want to understand few low-level details.

I also get this feeling that C++0x memory model is somehow related to C++0x multi-threading, as I often see these two together. If it is, how exactly? Why should they be related?

As I don't know how internals of multi-threading works, and what memory model means in general , so please help me understanding these concepts. :-)

First, you have to learn to think like a Language Lawyer.

The C++ specification does not make reference to any particular compiler, operating system, or CPU. It makes reference to an abstract machine that is a generalization of actual systems. In the Language Lawyer world, the job of the programmer is to write code for the abstract machine; the job of the compiler is to actualize that code on a concrete machine. By coding rigidly to the spec, you can be certain that your code will compile and run without modification on any system with a compliant C++ compiler, whether today or 50 years from now.

The abstract machine in the C++98/C++03 specification is fundamentally single-threaded. So it is not possible to write multi-threaded C++ code that is "fully portable" with respect to the spec. The spec does not even say anything about the atomicity of memory loads and stores or the order in which loads and stores might happen, never mind things like mutexes.

Of course, you can write multi-threaded code in practice for particular concrete systems -- like pthreads or Windows. But there is no standard way to write multi-threaded code for C++98/C++03.

The abstract machine in C++0x is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory.

Consider the following example, where a pair of global variables are accessed concurrently by two threads:

           Global
           int x, y;

Thread 1            Thread 2
x = 17;             cout << y << " ";
y = 37;             cout << x << endl;

What might Thread 2 output?

Under C++98/C++03, this is not even Undefined Behavior; the question itself is meaningless because the standard does not contemplate anything called a "thread".

Under C++0x, the result is Undefined Behavior, because loads and stores need not be atomic in general. Which may not seem like much of an improvement... And by itself, it's not.

But with C++0x, you can write this:

           Global
           atomic<int> x, y;

Thread 1                 Thread 2
x.store(17);             cout << y.load() << " ";
y.store(37);             cout << x.load() << endl;

Now things get much more interesting. First of all, the behavior here is defined. Thread 2 could now print 0 0 (if it runs before Thread 1), 37 17 (if it runs after Thread 1), or 0 17 (if it runs after Thread 1 assigns to x but before it assigns to y).

What it cannot print is 37 0, because the default mode for atomic loads/stores in C++0x is to enforce sequential consistency. This just means all loads and stores must be "as if" they happened in the order you wrote them within each thread, while operations among threads can be interleaved however the system likes. So the default behavior of atomics provides both atomicity and ordering for loads and stores.

Now, on a modern CPU, ensuring sequential consistency can be expensive. In particular, the compiler is likely to emit full-blown memory barriers between every access here. But if your algorithm can tolerate out-of-order loads and stores; i.e., if it requires atomicity but not ordering; i.e., if it can tolerate 37 0 as output from this program, then you can write this:

           Global
           atomic<int> x, y;

Thread 1                            Thread 2
x.store(17,memory_order_relaxed);   cout << y.load(memory_order_relaxed) << " ";
y.store(37,memory_order_relaxed);   cout << x.load(memory_order_relaxed) << endl;

The more modern the CPU, the more likely this is to be faster than the previous example.

Finally, if you just need to keep particular loads and stores in order, you can write:

           Global
           atomic<int> x, y;

Thread 1                            Thread 2
x.store(17,memory_order_release);   cout << y.load(memory_order_acquire) << " ";
y.store(37,memory_order_release);   cout << x.load(memory_order_acquire) << endl;

This takes us back to the ordered loads and stores -- so 37 0 is no longer a possible output -- but it does so with minimal overhead. (In this trivial example, the result is the same as full-blown sequential consistency; in a larger program, it would not be.)

Of course, if the only outputs you want to see are 0 0 or 37 17, you can just wrap a mutex around the original code. But if you have read this far, I bet you already know how that works, and this answer is already longer than I intended :-).

So, bottom line. Mutexes are great, and C++0x standardizes them. But sometimes for performance reasons you want lower-level primitives (e.g., the classic double-checked locking pattern). The new standard provides high-level gadgets like mutexes and condition variables, and it also provides low-level gadgets like atomic types and the various flavors of memory barrier. So now you can write sophisticated, high-performance concurrent routines entirely within the language specified by the standard, and you can be certain your code will compile and run unchanged on both today's systems and tomorrow's.

Although to be frank, unless you are an expert and working on some serious low-level code, you should probably stick to mutexes and condition variables. That's what I intend to do.

For more on this stuff, see this blog post.

In C++, why should `new` be used as little as possible?

128 votes

I stumbled upon the following question on Stack Overflow. One of the first posters says:

Stop using new so much. I can't see any reason you used new anywhere you did. You can create objects by value in C++ and it's one of the huge advantages to using the language. You do not have to allocate everything on the heap. Stop thinking like a Java programmer

I'm not really sure what he means by that. Could anyone clarify why objects should be created by value in C++ as often as possible, and what difference it makes internally? (or if I misinterpreted the answer, feel free to clarify what was meant)

There are 2 widely used memory allocation techniques: automatic allocation and dynamic allocation. There are corresponding regions of memory for each: the stack and the heap.

Stack

The stack always allocates memory in a sequential fashion. It can do so because it requires you to release the memory in the reverse order (First-In, Last-Out: FILO). This is the memory allocation technique for local variables in many programming languages. It is very, very fast because it requires minimal book keeping and the next address to allocate is implicit.

In C++, this is also called automatic storage because the storage is claimed automatically at the end of scope. As soon as execution of current code block (delimited using {}) is completed, memory for all variables in that block is automatically collected. This is also the moment where destructors are invoked to clean up resources.

Heap

The heap allows for a more flexible memory allocation mode. Bookkeeping is more complex and allocation is slower. Because there is no implicit release point, you must release the memory manually (using delete or delete[], free in C, etc.). However, the absence of an implicit release point is the key to the heap's flexibility.

Reasons to use the heap

Even if using the heap is slower and potentially leads to memory leaks or memory fragmentation, there are perfectly good use cases for the heap, as it's less limited.

Two key reasons to use the heap:

  • You don't know how much memory you need at compile-time. For instance, when reading a text file into a string, you usually don't know what size the file has, so you can't decide how much memory to allocate until you run the program.

  • You want to allocate memory which will persist after leaving the current block. For instance, you may want to write a function string readfile(string path) that returns the contents of a file. In this case, even if the stack could hold the entire file contents, you could not return from a function and keep the allocated memory block.

Why using the heap is often unnecessary.

In C++, there's a neat construct called a destructor. This mechanism allows you to manage resources by aligning the lifetime of the resource with the lifetime of a variable. It is commonly used to "wrap" resources into another object. std::string is a perfect example. This snippet:

int main ( int argc, char* argv[] )
{
    std::string program(argv[0]);
}

actually allocates a variable amount of memory. The std::string object allocates memory using the heap and releases it in its destructor. In this case, you did not need to manually manage any resources and still get the benefits of the dynamic memory allocation.

In particular, it implies that in this snippet:

int main ( int argc, char* argv[] )
{
    std::string * program = new std::string(argv[0]);
    delete program;
}

there is unneeded dynamic memory allocation. The program requires more typing (!) and introduces the risk of forgetting to deallocate the memory. It does this with no apparent benefit.

Why you should use the stack as often as possible

Basically, the last paragraph sums it up. Using the stack as often as possible makes your programs:

  • faster to type;
  • faster when run;
  • less prone to memory/resource leaks.

Bonus points

In the referenced question, there are additional concerns. In particular, the following class:

class Line {
public:
    Line();
    ~Line();
    std::string* mString;
};

Line::Line() {
    mString = new std::string("XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX");
}

Line::~Line() {
   //mString->clear(); // should not be neccessary
    delete mString;
}

Is actually a lot more risky to use than the following one:

class Line {
public:
    Line();
    std::string mString;
};

Line::Line() {
    mString = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX";
    // note: there is a cleaner way to write this.
}

The reason is that std::string properly defines a copy constructor. Consider the following program:

int main ()
{
    Line l1;
    Line l2 = l1;
}

Using the original version, this program will likely crash, as it uses delete on the same string twice. Using the modified version, each Line instance will own it's own string instance, each with it's own memory and both will be released at the end of the program.

Other notes

Extensive use of RAII is considered a best practice in C++ because of all the reasons above. However, there is an additional benefit which is not immediately obvious. Basically, it's better than the sum of its parts. The whole mechanism composes. It scales.

If you use the Line class as a building block:

 class Table
 {
      Line borders[4];
 };

Then

 int main ()
 {
     Table table;
 }

allocates 4 std::string instances, 4 Line instances, 1 Table instance and all the strings contents and *everything is freed automagically.

Is multiplication and division using shift operators in C actually faster?

74 votes

Multiplication and division can be achieved using bit operators, for example

i*2 = i<<1
i*3 = (i<<1) + i;
i*10 = (i<<3) + (i<<1)

and so on.

Is it actually faster to use say (i<<3)+(i<<1) to multiply with 10 than using i*10 directly? Is there any sort of input that can't be multiplied or divided in this way?

Short answer: Not likely.

Long answer: Your compiler has an optimizer in it that knows how to multiply as quickly as your target processor architecture is capable. Your best bet is to tell the compiler your intent clearly (i.e. i*2 rather than i << 1) and let it decide what the fastest assembly/machine code sequence is. It's even possible that the processor itself has implemented the multiply instruction as a sequence of shifts & adds in microcode.

Bottom line--don't spend a lot of time worrying about this. If you mean to shift, shift. If you mean to multiply, multiply. Do what is semantically clearest--your coworkers will thank you later. Or, more likely, curse you later if you do otherwise.

Protecting executable from reverse engineering?

68 votes

I've been contemplating how to protect my C/C++ code from disassembly and reverse engineering. Normally I would never condone this behavior myself in my code; however the current protocol I've been working on must not ever be inspected or understandable, for the security of various people.

Now this is a new subject to me, and the internet is not really resourceful for prevention against reverse engineering but rather depicts tons of information on how to reverse engineer

Some of the things I've thought of so far are:

  • Code injection (calling dummy functions before and after actual function calls)
  • Code obfustication (mangles the disassembly of the binary)
  • Write my own startup routines (harder for debuggers to bind to)

    void startup();  
    int _start()   
    {  
        startup( );  
        exit   (0)   
    }  
    void startup()  
    {  
        /* code here */  
    }
    
  • Runtime check for debuggers (and force exit if detected)

  • Function trampolines

     void trampoline(void (*fnptr)(), bool ping = false)  
     {  
       if(ping)  
         fnptr();  
       else  
         trampoline(fnptr, true);  
     }
    
  • Pointless allocations and deallocations (stack changes a lot)

  • Pointless dummy calls and trampolines (tons of jumping in disassembly output)
  • Tons of casting (for obfuscated disassembly)

I mean these are some of the things I've thought of but they can all be worked around and or figured out by code analysts given the right time frame. Is there anything else alternative I have?

What Amber said is exactly right. You can make reverse engineering harder, but you can never prevent it. You should never trust "security" that relies on the prevention of reverse engineering.

That said, the best anti-reverse-engineering techniques that I've seen focused not on obfuscating the code, but instead on breaking the tools that people usually use to understand how code works. Finding creative ways to break disassemblers, debuggers, etc is both likely to be more effective and also more intellectually satisfying than just generating reams of horrible spaghetti code. This does nothing to block a determined attacker, but it does increase the likelihood that J Random Cracker will wander off and work on something easier instead.

gcc compile error with >2GB of code

61 votes

I have a huge number of functions totaling around 2.8 GB of object code (unfortunately there's no way around, scientific computing ...)

When I try to link them, I get (expected) relocation truncated to fit: R_X86_64_32S errors, that I hoped to circumvent by specifing the compiler flag -mcmodel=medium. All libraries that are linked in addition that I have control of are compiled with the -fpic flag.

Still, the error persists and I assume that some libraries I link to are not compiled with PIC.

Here's the error:

/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x12): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_fini'     defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x19): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_init'    defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crti.o: In function    `call_gmon_start':
(.text+0x7): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol      `__gmon_start__'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtbegin.o: In function `__do_global_dtors_aux':
crtstuff.c:(.text+0xb): relocation truncated to fit: R_X86_64_PC32 against `.bss' 
crtstuff.c:(.text+0x13): relocation truncated to fit: R_X86_64_32 against symbol `__DTOR_END__' defined in .dtors section in /usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtend.o
crtstuff.c:(.text+0x19): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x28): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x38): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x3f): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x46): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x51): additional relocation overflows omitted from the output
collect2: ld returned 1 exit status
make: *** [testsme] Error 1

and those are system libraries I link against: -lgfortran -lm -lrt -lpthread

Any clues where to look for the problem??

EDIT: First of all, thank you for the discussion ... To clarify a bit, I have hundreds of functions (each approx 1~MB in size in separate object files) like this:

double func1(std::tr1::unordered_map<int, double> & csc, 
             std::vector<EvaluationNode::Ptr> & ti, 
             ProcessVars & s)
{
    double sum, prefactor, expr;

    prefactor = +s.ds8*s.ds10*ti[0]->value();
    expr =       ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
           1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -
           27/10.*s.x14*s.x15*csc[49304] + 12/5.*s.x14*s.x15*csc[49305] -
           3/10.*s.x14*s.x15*csc[49306] - 4/5.*s.x14*s.x15*csc[49307] +
           21/10.*s.x14*s.x15*csc[49308] + 1/10.*s.x14*s.x15*csc[49309] -
           s.x14*s.x15*csc[51370] - 9/10.*s.x14*s.x15*csc[51371] -
           1/10.*s.x14*s.x15*csc[51372] + 3/5.*s.x14*s.x15*csc[51373] +
           27/10.*s.x14*s.x15*csc[51374] - 12/5.*s.x14*s.x15*csc[51375] +
           3/10.*s.x14*s.x15*csc[51376] + 4/5.*s.x14*s.x15*csc[51377] -
           21/10.*s.x14*s.x15*csc[51378] - 1/10.*s.x14*s.x15*csc[51379] -
           2*s.x14*s.x15*csc[55100] - 9/5.*s.x14*s.x15*csc[55101] -
           1/5.*s.x14*s.x15*csc[55102] + 6/5.*s.x14*s.x15*csc[55103] +
           27/5.*s.x14*s.x15*csc[55104] - 24/5.*s.x14*s.x15*csc[55105] +
           3/5.*s.x14*s.x15*csc[55106] + 8/5.*s.x14*s.x15*csc[55107] -
           21/5.*s.x14*s.x15*csc[55108] - 1/5.*s.x14*s.x15*csc[55109] -
           2*s.x14*s.x15*csc[55170] - 9/5.*s.x14*s.x15*csc[55171] -
           1/5.*s.x14*s.x15*csc[55172] + 6/5.*s.x14*s.x15*csc[55173] +
           27/5.*s.x14*s.x15*csc[55174] - 24/5.*s.x14*s.x15*csc[55175] +
           // ...
           ;

        sum += prefactor*expr;
    // ...
    return sum;
}

The object s is relatively small and keeps the needed constants x14, x15, ..., ds0, ..., etc while ti just returns a double from an external library. As you can see, csc[] is a precomputed map of values which is also evaluated in separate object files (again hundreds with about ~1 MB of size each) of the following form:

void cscs132(std::tr1::unordered_map<int,double> & csc, ProcessVars & s)
{
    {
    double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
           32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x35*s.x45*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.mbpow4*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.x35*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.x45*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35*s.mbpow4*s.mWpowinv2 +
           32*s.x12pow2*s.x35pow2*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35pow2*s.x45*s.mWpowinv2 +
           64*s.x12pow2*s.x35*s.x45*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35*s.x45pow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.mbpow4*s.mWpowinv2 +
           64*s.x12*s.p1p3*s.x15pow2*s.mbpow2*s.mWpowinv2 +
           96*s.x12*s.p1p3*s.x15*s.x25*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.x45*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.mbpow4*s.mWpowinv2 +
           32*s.x12*s.p1p3*s.x25pow2*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.x45*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x45*s.mbpow2 +
           64*s.x12*s.x14*s.x15pow2*s.x35*s.mWpowinv2 +
           96*s.x12*s.x14*s.x15*s.x25*s.x35*s.mWpowinv2 +
           32*s.x12*s.x14*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.x14*s.x15*s.x35pow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x15*s.x35*s.x45*s.mWpowinv2 +
           32*s.x12*s.x14*s.x25pow2*s.x35*s.mWpowinv2 +
           32*s.x12*s.x14*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x25*s.x35pow2*s.mWpowinv2 -
           // ...

       csc.insert(cscMap::value_type(192953, csc19295));
    }

    {
       double csc19296 =      // ... ;

       csc.insert(cscMap::value_type(192956, csc19296));
    }

    // ...
}

and that's about it. The final step then just consists in calling all those func[i] and summing the result up.

Concerning the fact that this is a rather special and unusual case: Yes it is. This is what people have to cope with when trying to do high precision computations for particle physics.

EDIT2: I should also add that x12, x13, etc are not really constants. They are set to specific values, all those functions are run and the result returned, and then a new set of x12, x13, etc is chosen to produce the next value. And this has to be done 10^5 to 10^6 times...

EDIT3: Thank you for the suggestions and the discussion so far... I'll try to roll the loops up upon code generation somehow, not sure how to this exactly, to be honest, but this is the best bet.

Btw, I didn't try to hide behind "this is scientific computing -- no way to optimize". It's just that the basis for this code is something that comes out of a "black box" where I have no real access to and, moreover, the whole thing worked great with simple examples and I mainly feel overwhelmed with what happens in a real world application ...

EDIT4: So, I have managed to reduce the code size of the csc definitions by about one forth by simplifying expressions in a computer algebra system (Mathematica). I see now also some way to reduce it by another order of magnitude or so by applying some other tricks before generating the code (which would bring this part down to about 100MB) and I hope this idea works.

Now related to your answers: I'm trying to roll the loops back up again in the funcs, where a CAS won't help much, but I have already some ideas. For instance, sorting the expressions by the variables like x12, x13,..., parse the cscs with Python and generate tables that relate them to each other. Then I can at least generate these parts as loops. As this seems to be the best solution so far, I mark this as the best answer.

However, I'd like to also give credit to VJo. GCC 4.6 indeed works much better, produces smaller code and is faster. Using the large model works at the code as-is. So technically this is the correct answer, but changing the whole concept is a much better approach.

Thank you all for your suggestions and help. If anyone is interested, I'm going to post the final outcome as soon as I am ready.

REMARKS: Just some remarks to some other answers: The code I'm trying to run does not originate in an expansion of simple functions/algorithms and stupid unnecessary unrolling. What actually happens is that the stuff we start with are pretty complicated mathematical objects and bringing them to a numerically computable form generates these expressions. The problem lies actually in the underlying physical theory. Complexity of intermediate expressions scales factorially, which is well known, but when combining all of this stuff to something physically measureable -- an observable -- it just boils down to only a handful of very small functions that form the basis of the expressions. (There is definitely something "wrong" in this respect with the general and only available ansatz which is called "perturbation theory") We try to bring this ansatz to another level, which is not feasible analytically anymore and where the basis of needed functions is not known. So we try to brute-force it like this. Not the best way, but hopefully one that helps with our understanding of the physics at hand in the end...

LAST EDIT: Thanks to all your suggestions, I've managed to reduce the code size considerably, using Mathematica and a modification of the code generator for the funcs somewhat along the lines of the top answer :)

I have simplified the csc functions with Mathematica, bringing it down to 92MB. This is the irreducible part. The first attempts took forever, but after some optimizations this now runs through in about 10mins on a single CPU.

The effect on the funcs was dramatic: The whole code size for them is down to approximately 9 MB, so the code now totals in the 100 MB range. Now it makes sense to turn optimizations on and the execution is quite fast.

Again, thank you all for your suggestions, I've learned a lot.

So, you already have a program that produces this text:

prefactor = +s.ds8*s.ds10*ti[0]->value();
expr = ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
       1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -...

and

double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
       32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
       32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
       32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -...

right?

If all your functions have a similar "format" (multiply n numbers m times and add the results - or something similar) then I think you can do this:

  • change the generator program to output offsets instead of strings (i.e. instead of the string "s.ds0" it will produce offsetof(ProcessVars, ds0)
  • create an array of such offsets
  • write an evaluator which accepts the array above and the base addresses of the structure pointers and produces an result

The array+evaluator will represent the same logic as one of your functions, but only the evaluator will be code. The array is "data" and can be either generated at runtime or saved on disk and read i chunks or with a memory mapped file.

For your particular example in func1 imagine how you would rewrite the function via an evaluator if you had access to the base address of s and csc and also a vector like representation of the constants and the offsets you need to add to the base addresses to get to x14, ds8 and csc[51370]

You need to create a new form of "data" that will describe how to process the actual data you pass to your huge number of functions.

How can I reliably get the address of an object?

57 votes

Consider the following program:

struct ghost
{
    // ghosts like to pretend that they don't exist
    ghost* operator&() const volatile { return 0; }
};

int main()
{
    ghost clyde;
    ghost* clydes_address = &clyde; // darn; that's not clyde's address :'( 
}

How do I get clyde's address?

I'm looking for a solution that will work equally well for all types of objects. A C++03 solution would be nice, but I'm interested in C++0x solutions too. If possible, let's avoid any implementation-specific behavior.

I am aware of C++0x's std::addressof function template, but am not interested in using it here: I'd like to understand how a Standard Library implementer might implement this function template.

Let us first copy the code from Boost, minus the compiler work around bits:

template<class T>
struct addr_impl_ref
{
  T & v_;

  inline addr_impl_ref( T & v ): v_( v ) {}
  inline operator T& () const { return v_; }

private:
  addr_impl_ref & operator=(const addr_impl_ref &);
};

template<class T>
struct addressof_impl
{
  static inline T * f( T & v, long ) {
    return reinterpret_cast<T*>(
        &const_cast<char&>(reinterpret_cast<const volatile char &>(v)));
  }

  static inline T * f( T * v, int ) { return v; }
};

template<class T>
T * addressof( T & v ) {
  return addressof_impl<T>::f( addr_impl_ref<T>( v ), 0 );
}

What happens if we pass a reference to function ?

Note: addressof cannot be used with a pointer to function

In C++ if void func(); is declared, then func is a reference to a function taking no argument and return no result. This reference to a function can be trivially converted into a pointer to function -- from @Konstantin: According to 13.3.3.2 both T & and T * are indistinguishable for functions. The 1st one is an Identity conversion and the 2nd one is Function-to-Pointer conversion both having "Exact Match" rank (13.3.3.1.1 table 9).

The reference to function pass through addr_impl_ref, there is an ambiguity in the overload resolution for the choice of f, which is solved thanks to the dummy argument 0, which is an int first and could be promoted to a long (Integral Promotion).

Thus we simply returns the pointer.

What happens if we pass a type with a conversion operator ?

If the conversion operator yields a T* then we have an ambiguity: for f(T&,long) an Integral Promotion is required for the second argument while for f(T*,int) the conversion operator is called on the first (thanks to @litb)

That's when addr_impl_ref kicks in. The C++ Standard mandates that a conversion sequence may contain at most one user-defined conversion. By wrapping the type in addr_impl_ref and forcing the use of a conversion sequence already, we "disable" any conversion operator that the type comes with.

Thus the f(T&,long) overload is selected (and the Integral Promotion performed).

What happens for any other type ?

Thus the f(T&,long) overload is selected, because there the type does not match the T* parameter.

Note: from the remarks in the file regarding Borland compatibility, arrays do not decay to pointers, but are passed by reference.

What happens in this overload ?

We want to avoid applying operator& to the type, as it may have been overloaded.

The Standard guarantees that reinterpret_cast may be used for this work (see @Matteo Italia's answer: 5.2.10/10).

Boost adds some niceties with const and volatile qualifiers to avoid compiler warnings (and properly use a const_cast to remove them).

  • Cast T& to char const volatile&
  • Stip the const and volatile
  • Apply the & operator to take the address
  • Cast back to a T*

The const/volatile juggling is a bit of black magic, but it does simplify the work (rather than providing 4 overloads). Note that since T is unqualified, if we pass a ghost const&, then T* is ghost const*, thus the qualifiers have not really been lost.

EDIT: the pointer overload is used for pointer to functions, I amended the above explanation somewhat. I still do not understand why it is necessary though.

The following ideone output sums this up, somewhat.

Weird C++ Syntax

52 votes

I came across this weird C++ program.

#include <iostream>
using namespace std;
int main()
{
  int a = ({int x; cin >> x; x;});
  cout << a;
}

Can anyone explain what is going on? What is this construct called?

Assigns user input value to a and prints it out. it is done by using a Statement Expression.

Statement Expressions:
A compound statement is a sequence of statements enclosed by braces. In GNU C, a compound statement inside parentheses may appear as an expression in what is called a Statement expression.

         .--------------.
         V              |
>>-(--{----statement--;-+--}--)--------------------------------><

The value of a statement expression is the value of the last simple expression to appear in the entire construct. If the last statement is not an expression, then the construct is of type void and has no value.

Note: This excerpt is taken from IBM XL C/C++ v7.0 doccumentation.

This construct is an orthogonal extension of GNU C and is supported only in IBM C++.

Is main() overloaded in C++ ?

39 votes

There are 2 valid versions of main() exist in C++:

int main()  // version 1
int main(int argc, char **argv)  // version 2

(Even old gcc allows replacing char **argv with int **argv !!)

But both the versions cannot co-exist at the same time ! (use case can be like: while running the binary from command prompt, if you pass no argument then 1st version should be called else the 2nd version).

Is there a special compiler check to just allow only one version in one binary ?

§3.6.1/2 (C++03) says

An implementation shall not predefine the main function. This function shall not be overloaded. It shall have a return type of type int, but otherwise its type is implementation-defined. All implementations shall allow both of the following definitions of main:

   int main() { /* ... */ }
   int main(int argc, char* argv[]) { /* ... */ }

You can use either of them. Both are standard compliant.

Also, since char *argv[] is equivalent to char **argv, replacing char *argv[] with char **argv doesn't make any difference.


But both the versions cannot co-exist at the same time ! (use case can be like: while running the binary from command prompt, if you pass no argument then 1st version should be called else the 2nd version).

No. Both versions cannot co-exist at the same time. One program can have exactly one main function. Which one, depends on your choice. If you want to process command-line argument, then you've to choose the second version, or else first version is enough. Also note that if you use second version, and don't pass any command line argument, then there is no harm in it. It will not cause any error. You just have to interpret argc and argv accordingly, and based on their value, you've to write the logic and flow of your program.

Does const-correctness give the compiler more room for optimization?

33 votes

I know that it improves readability and makes the program less error-prone, but how much does it improve the performance?

And on a side note, what's the major difference between a reference and a const pointer? I would assume they're stored in the memory differently, but how so?

Thanks.

[Edit: OK so this question is more subtle than I thought at first.]

Declaring a pointer-to-const or reference-of-const never helps any compiler to optimize anything. (Although see the Update at the bottom of this answer.)

The const declaration only indicates how an identifier will be used within the scope of its declaration; it does not say that the underlying object can not change.

Example:

int foo(const int *p) {
    int x = *p;
    bar(x);
    x = *p;
    return x;
}

The compiler cannot assume that *p is not modified by the call to bar(), because p could be (e.g.) a pointer to a global int and bar() might modify it.

If the compiler knows enough about the caller of foo() and the contents of bar() that it can prove bar() does not modify *p, then it can also perform that proof without the const declaration.

But this is true in general. Because const only has an effect within the scope of the declaration, the compiler can already see how you are treating the pointer or reference within that scope; it already knows that you are not modifying the underlying object.

So in short, all const does in this context is prevent you from making mistakes. It does not tell the compiler anything it does not already know, and therefore it is irrelevant for optimization.

What about functions that call foo()? Like:

int x = 37;
foo(&x);
printf("%d\n", x);

Can the compiler prove that this prints 37, since foo() takes a const int *?

No. Even though foo() takes a pointer-to-const, it might cast the const-ness away and modify the int. (This is not undefined behavior.) Here again, the compiler cannot make any assumptions in general; and if it knows enough about foo() to make such an optimization, it will know that even without the const.

The only time const might allow optimizations is cases like this:

const int x = 37;
foo(&x);
printf("%d\n", x);

Here, to modify x through any mechanism whatsoever (e.g., by taking a pointer to it and casting away the const) is to invoke Undefined Behavior. So the compiler is free to assume you do not do that, and it can propagate the constant 37 into the printf(). This sort of optimization is legal for any object you declare const. (In practice, a local variable to which you never take a reference will not benefit, because the compiler can already see whether you modify it within its scope.)

To answer your "side note" question, (a) a const pointer is a pointer; and (b) a const pointer can equal NULL. You are correct that the internal representation (i.e. an address) is most likely the same.

[update]

As Christoph points out in the comments, my answer is incomplete because it does not mention restrict.

Section 6.7.3.1 (4) of the C99 standard says:

During each execution of B, let L be any lvalue that has &L based on P. If L is used to access the value of the object X that it designates, and X is also modified (by any means), then the following requirements apply: T shall not be const-qualified. ...

(Here B is a basic block over which P, a restrict-pointer-to-T, is in scope.)

So if a C function foo() is declared like this:

foo(const int * restrict p)

...then the compiler may assume that no modifications to *p occur during the lifetime of p -- i.e., during the execution of foo() -- because otherwise the Behavior would be Undefined.

So in principle, combining restrict with a pointer-to-const could enable both of the optimizations that are dismissed above. Do any compilers actually implement such an optimization, I wonder? (GCC 4.5.2, at least, does not.)

Note that restrict only exists in C, not C++ (not even C++0x), except as a compiler-specific extension.

How much is too much with C++0x auto keyword

33 votes

I've been using the new auto keyword available in the C++0x standard for complicated templated types which is what I believe it was designed for. But I'm also using it for things like:

auto foo = std::make_shared<Foo>();

And more skeptically for:

auto foo = bla(); // where bla() return a shared_ptr<Foo>

I haven't seen much discussion on this topic. It seems that auto could be overused, since a type is often a form of documentation and sanity checks. Where do you draw the line in using auto and what are the recommended use cases for this new feature?

To clarify: I'm not asking for a philosophical opinion; I'm asking for the intended use of this keyword by the standard committee, possibly with comments on how that intended use is realized in practice.

Side note: This question was moved to SE.Programmers and then back to Stack Overflow. Discussion about this can be found in this meta question.

I think that one should use the auto keyword whenever it's hard to say how to write the type at first sight, but the type of the right hand side of an expression is obvious. For example, using:

my_multi_type::nth_index<2>::type::key_type::composite_key_type::key_extractor_tuple::tail_type::head_type::result_type

to get the composite key type in boost::multi_index, even though you know that it is int. You can't just write int because it could be changed in the future. I would write auto in this case.

So if the auto keyword improves readability in a particular case then use it. You can write auto when it is obvious to the reader what type auto represents.

Here are some examples:

auto foo = std::make_shared<Foo>(); // obvious
auto foo = bla(); // unclear. don't know which type `foo` has

const size_t max_size = 100;
for ( auto x = max_size; x > 0; --x ) // unclear. could lead to the errors
                                      // since max_size is unsigned

std::vector<some_class> v;
for ( auto it = v.begin(); it != v.end(); ++it ) // ok, since I know
// that `it` has an iterator type (don't really care which one in this context)

Optimizations for pow() with const non-integer exponent?

32 votes

I have hot spots in my code where I'm doing pow() taking up around 10-20% of my execution time.

My input to pow(x,y) is very specific, so I'm wondering if there's a way to roll two pow() approximations (one for each exponent) with higher performance:

  • I have two constant exponents: 2.4 and 1/2.4.
  • When the exponent is 2.4, x will be in the range (0.090473935, 1.0].
  • When the exponent is 1/2.4, x will be in the range (0.0031308, 1.0].
  • I'm using SSE/AVX float vectors. If platform specifics can be taken advantage of, right on!

A maximum error rate around 0.01% is ideal, though I'm interested in full precision (for float) algorithms as well.

I'm already using a fast pow() approximation, but it doesn't take these constraints into account. Is it possible to do better?

Another answer because this is very different from my previous answer, and this is blazing fast. Relative error is 3e-8. Want more accuracy? Add a couple more Chebychev terms. It's best to keep the order odd as this makes for a small discontinuity between 2^n-epsilon and 2^n+epsilon.

#include <stdlib.h>
#include <math.h>

// Returns x^(5/12) for x in [1,2), to within 3e-8 (relative error).
// Want more precision? Add more Chebychev polynomial coefs.
double pow512norm (
   double x)
{
   static const int N = 8;

   // Chebychev polynomial terms.
   // Non-zero terms calculated via
   //   integrate (2/pi)*ChebyshevT[n,u]/sqrt(1-u^2)*((u+3)/2)^(5/12)
   //   from -1 to 1
   // Zeroth term is similar except it uses 1/pi rather than 2/pi.
   static const double Cn[N] = { 
       1.1758200232996901923,
       0.16665763094889061230,
      -0.0083154894939042125035,
       0.00075187976780420279038,
      // Wolfram alpha doesn't want to compute the remaining terms
      // to more precision (it times out).
      -0.0000832402,
       0.0000102292,
      -1.3401e-6,
       1.83334e-7};

   double Tn[N];

   double u = 2.0*x - 3.0;

   Tn[0] = 1.0;
   Tn[1] = u;
   for (int ii = 2; ii < N; ++ii) {
      Tn[ii] = 2*u*Tn[ii-1] - Tn[ii-2];
   }   

   double y = 0.0;
   for (int ii = N-1; ii >= 0; --ii) {
      y += Cn[ii]*Tn[ii];
   }   

   return y;
}


// Returns x^(5/12) to within 3e-8 (relative error).
double pow512 (
   double x)
{
   static const double pow2_512[12] = {
      1.0,
      pow(2.0, 5.0/12.0),
      pow(4.0, 5.0/12.0),
      pow(8.0, 5.0/12.0),
      pow(16.0, 5.0/12.0),
      pow(32.0, 5.0/12.0),
      pow(64.0, 5.0/12.0),
      pow(128.0, 5.0/12.0),
      pow(256.0, 5.0/12.0),
      pow(512.0, 5.0/12.0),
      pow(1024.0, 5.0/12.0),
      pow(2048.0, 5.0/12.0)
   };

   double s;
   int iexp;

   s = frexp (x, &iexp);
   s *= 2.0;
   iexp -= 1;

   div_t qr = div (iexp, 12);
   if (qr.rem < 0) {
      qr.quot -= 1;
      qr.rem += 12;
   }

   return ldexp (pow512norm(s)*pow2_512[qr.rem], 5*qr.quot);
}

Addendum: What's going on here?
Per request, the following explains how the above code works.

Overview
The above code defines two functions, double pow512norm (double x) and double pow512 (double x). The latter is the entry point to the suite; this is the function that user code should call to calculate x^(5/12). The function pow512norm(x) uses Chebyshev polynomials to approximate x^(5/12), but only for x in the range [1,2]. (Use pow512norm(x) for values of x outside that range and the result will be garbage.)

The function pow512(x) splits the incoming x into a pair (double s, int n) such that x = s * 2^n and such that 1≤s<2. A further partitioning of n into (int q, unsigned int r) such that n = 12*q + r and r is less than 12 lets me split the problem of finding x^(5/12) into parts:

  1. x^(5/12)=(s^(5/12))*((2^n)^(5/12)) via (u*v)^a=(u^a)*(v^a) for positive u,v and real a.
  2. s^(5/12) is calculated via pow512norm(s).
  3. (2^n)^(5/12)=(2^(12*q+r))^(5/12) via substitution.
  4. 2^(12*q+r)=(2^(12*q))*(2^r) via u^(a+b)=(u^a)*(u^b) for positive u, real a,b.
  5. (2^(12*q+r))^(5/12)=(2^(5*q))*((2^r)^(5/12)) via some more manipulations.
  6. (2^r)^(5/12) is calculated by the lookup table pow2_512.
  7. Calculate pow512norm(s)*pow2_512[qr.rem] and we're almost there. Here qr.rem is the r value calculated in step 3 above. All that is needed is to multiply this by 2^(5*q) to yield the desired result.
  8. That is exactly what the math library function ldexp does.

Function Approximation
The goal here is to come up with an easily computable approximation of f(x)=x^(5/12) that is 'good enough' for the problem at hand. Our approximation should be close to f(x) in some sense. Rhetorical question: What does 'close to' mean? Two competing interpretations are minimizing the mean square error versus minimizing the maximum absolute error.

I'll use a stock market analogy to describe the difference between these. Suppose you want to save for your eventual retirement. If you are in your twenties, the best thing to do is to invest in stocks or stock market funds. This is because over a long enough span of time, the stock market on average beats any other investment scheme. However, we've all seen times when putting money into stocks is a very bad thing to do. If you are in your fifties or sixties (or forties if you want to retire young) you need to invest a bit more conservatively. Those downswings can wreak have on your retirement portfolio.

Back to function approximation: As the consumer of some approximation, you are typically worried about the worst-case error rather than the performance "on average". Use some approximation constructed to give the best performance "on average" (e.g. least squares) and Murphy's law dictates that your program will spend a whole lot of time using the approximation exactly where the performance is far worse than average. What you want is a minimax approximation, something that minimizes the maximum absolute error over some domain. A good math library will take a minimax approach rather than a least squares approach because this lets the authors of the math library give some guaranteed performance of their library.

Math libraries typically use a polynomial or a rational polynomial to approximate some function f(x) over some domain a≤x≤b. Suppose the function f(x) is analytic over this domain and you want to approximate the function by some polynomial p(x) of degree N. For a given degree N there exists some magical, unique polynomial p(x) such that p(x)-f(x) has N+2 extrema over [a,b] and such that the absolute values of these N+2 extrema are all equal to one another. Finding this magical polynomial p(x) is the holy grail of function approximators.

I did not find that holy grail for you. I instead used a Chebyshev approximation. The Chebyshev polynomials of the first kind are an orthogonal (but not orthonormal) set of polynomials with some very nice features when it comes to function approximation. The Chebyshev approximation oftentimes is very close to that magical polynomial p(x). (In fact, the Remez exchange algorithm that does find that holy grail polynomial typically starts with a Chebyshev approximation.)

pow512norm(x)
This function uses Chebyshev approximation to find some polynomial p*(x) that approximates x^(5/12). Here I'm using p*(x) to distinguish this Chebyshev approximation from the magical polynomial p(x) described above. The Chebyshev approximation p*(x) is easy to find; finding p(x) is a bear. The Chebyshev approximation p*(x) is sum_i Cn[i]*Tn(i,x), where the Cn[i] are the Chebyshev coefficients and Tn(i,x) are the Chebyshev polynomials evaluated at x.

I used Wolfram alpha to find the Chebyshev coefficients Cn for me. For example, this calculates Cn[1]. The first box after the input box has the desired answer, 0.166658 in this case. That's not as many digits as I would like. Click on 'more digits' and voila, you get a whole lot more digits. Wolfram alpha is free; there is a limit on how much computation it will do. It hits that limit on higher order terms. (If you buy or have access to mathematica you will be able to calculate those high-order coefficients to a high degree of precision.)

The Chebyshev polynomials Tn(x) are calculated in the array Tn. Beyond giving something very close to magical polynomial p(x), another reason for using Chebyshev approximation is that the values of those Chebyshev polynomials are easily calculated: Start with Tn[0]=1 and Tn[1]=x, and then iteratively calculate Tn[i]=2*x*Tn[i-1] - Tn[i-2]. (I used 'ii' as the index variable rather than 'i' in my code. I never use 'i' as a variable name. How many words in the English language have an 'i' in the word? How many have two consecutive 'i's?)

pow512(x)
pow512 is the function that user code should be calling. I already described the basics of this function above. A few more details: The math library function frexp(x) returns the significand s and exponent iexp for the input x. (Minor issue: I want s between 1 and 2 for use with pow512norm but frexp returns a value between 0.5 and 1.) The math library function div returns the quotient and remainder for integer division in one swell foop. Finally, I use the math library function ldexp to put the three parts together to form the final answer.

Using "double" as counter variables in loops

31 votes

In a book I am currently reading, there is this excerpt:

You can also use a floating-point value as a loop counter. Here's an example of a for loop with this kind of counter:

double a(0.3), b(2.5);
for(double x = 0.0; x <= 2.0; x += 0.25)
    cout << "\n\tx = " << x << "\ta*x + b = " << a*x + b;

This code fragment calculates the value of a*x+b for values of x from 0.0 to 2.0, in steps of 0.25; however, you need to take care when using a floating-point counter in a loop. Many decimal values cannot be represented exactly in binary floating-point form, so discrepancies can build up with cumulative values. This means that you should not code a for loop such that ending the loop depends on a floating-point loop counter reaching a precise value. For example, the following poorly-designed loop never ends:

for(double x = 0.0 ; x != 1.0 ; x += 0.2)
    cout << x;

The intention with this loop is to output the value of x as it varies from 0.0 to 1.0; however, 0.2 has no exact representation as a binary floating-point value, so the value of x is never exactly 1. Thus, the second loop control expression is always false, and the loop continues indefinitely.

Can someone please explain how the first code block runs while the second doesn't?

The first one will eventually terminate, even if x doesn't reach exactly 2.0... because it'll end up being greater than 2.0, and thus break out.

The second one would have to make x hit exactly 1.0 in order to break.

It's unfortunate that the first example uses a step of 0.25, which is exactly representable in binary floating point - it would have been smarter to make both examples use 0.2 as the step size. (0.2 isn't exactly representable in binary floating point.)

Is memory allocation a system call?

27 votes

Is memory allocation a system call? For example, malloc and new. Is the heap shared by different processes and managed by the OS. What about private heap? If memory allocation in the heap is managed by the OS, how expensive is this?

I would also like to have some link to places where I can read more about this topic.

In general, malloc and new do not perform a system call at each invocation. However, they use a lower-level mechanism to allocate large pages of memory. On Windows, the lower mechanism is VirtualAlloc(). I believe on POSIX systems, this is somewhat equivalent to mmap(). Both of these perform a system call to allocate memory to the process at the OS level. Subsequent allocations will use smaller parts of those large pages without incurring a system call.

The heap is normally inner-process and is not shared between processes. If you need this, most OSes have an API for allocating shared memory. A portable wrapper for these APIs is available in the Boost.Interprocess library.

If you would like to learn more about memory allocation and relationship with the OS, you should take a look at a good book on operating systems. I always suggest Modern Operating Systems by Andrew S. Tanenbaum as it is very easy to read.

"xor eax, ebp" being used

26 votes

I just tried compiling a couple of C++ snippets on VS2010 and analyzed the executables on IDA Pro. Something I noticed is that there most of them have something like the following at the start(shortly after a call to __security_check_cookie)

xor eax, ebp

and something like

xor ecx, ebp

at the bottom. Why does this happen? The compiler optimization was turned off.

These are buffer overrun protection methods, and have nothing to do with compiler optimisation. MSVC will (if you specify the /GS switch) push a security cookie onto the stack near the return address so that it can detect a common case of stack corruption.

Stack corruption can either be caused by bad code along the lines of:

char buff[5];
strcpy (buff, "Man, this string is waaay too long!!");

or by malicious users taking advantage of bad coding practices, like the use of scanf ("%s", myBuff) for user input. Carefully crafted attacks like that can suborn your program to do things you probably don't want it to.

By placing a cookie close to the return address, a large number of bugs (and attack vectors) can be prevented, simply due to the fact that the memory corruptions tend to be sequential in nature. In other words, if you've overwritten the return address, it's probably because you started writing on one side of the cookie and corrupted memory all the way up to the return address on the other side of the cookie (hence the cookie will be overwritten as well).

It doesn't catch all bugs since you may have some code like:

char buff[5];
buff[87] = 'x';

which could potentially corrupt the return address without touching the cookie. But it will catch all those malicious ones which rely on entering a longer string than expected, which corrupt up to the return address (including cookie).

The sequence you're probably seeing in the code is something like:

mov  eax, dword ptr ds:___sec_cookie   ; fixed value.
xor  eax, ebp                          ; adjust based on base pointer.
mov  [ebp+SOMETHING], eax              ; store adjusted value.

which is customising the cookie, depending on the current base pointer.

This will change what is actually put on the stack at each stack level (and also depending on parameter count and sizes as well) and is probably an attempt to further secure the code from malicious intent, by ensuring a variable signature is written to the stack rather than a fixed value (otherwise the attacker could enter characters including a valid cookie).

And the sequence at the end will run something like this:

mov  ecx, [ebp+SOMETHING]              ; get the adjusted cookie.
xor  ecx, ebp                          ; un-adjust it, since
                                       ;   ((N xor X) xor X) == N.
call @__sec_check_cookie               ; check the cookie.

It's basically just the reverse process of that described above. The @__sec_check_cookie call will only return if ecx is set to the correct cookie value. Otherwise it will raise a fault, as confirmed here:

The __security_check_cookie() routine is straightforward: if the cookie was unchanged, it executes the RET instruction and ends the function call. If the cookie fails to match, the routine calls report_failure().

The report_failure() function then calls __security_error_handler(). Both functions are defined in the seccook.c file of the C run-time (CRT) source files.

CRT support is needed to make these security checks work. When a security check failure occurs, control of the program is passed to __security_error_handler(), which is summarized here:

void __cdecl __security_error_handler(int code, void *data)
{
    if (user_handler != NULL) {
      __try {
        user_handler(code, data);
      } __except (EXCEPTION_EXECUTE_HANDLER) {}
    } else {
      //...prepare outmsg...

      __crtMessageBoxA(
          outmsg,
          "Microsoft Visual C++ Runtime Library",
          MB_OK|MB_ICONHAND|MB_SETFOREGROUND|MB_TASKMODAL);
    }
    _exit(3);
}

By default, an application that fails a security check displays a dialog that states "Buffer overrun detected!". When the dialog is dismissed, the application terminates.

Are (bool)(i & 1) and i % 2 == 1 same?

26 votes

Are (bool)(i & 1) and i % 2 == 1 always same where i is int?

Note: saying always I mean for all platforms (even when a byte is 16 bit) and for all standards of C and C++.

Edit:

For all standards of C and C++ where bool exist.

No.

1s' complement representation of int, the representation of -1 is 1 ... 10, so they differ.

Anyway, i % 2 can be negative for negative i (indeed it's required to be in C99 when it's not 0), and hence not equal to 1 for negative odd numbers.

sizeof taking two arguments

25 votes

In C.1.3 of the C++ IS (2003. It's in the C++11 IS, too), the standard points out a difference between ISO C and C++; namely, for

char arr[100];

sizeof(0, arr) returns sizeof(char*) in C, but 100 in C++.

I can find no documentation for sizeof taking two arguments. The obvious fallback is the comma operator, but I don't think so: sizeof(arr) in C is 100; sizeof(0, arr) is sizeof(char*). Both sizeof(0, arr) and sizeof(arr) are 100 in C++.

I may be missing the whole point of the IS in this context. Can anyone help? This is similar to a question discussed back in '09, but no one referred to the IS, and I don't think the correct answer was given.


Edit: Actually, the IS is talking about the comma operator. So, for some reason (0, arr) returns a char* in C, but a char[100] in C++. Why?

In C then the array is decaying to a pointer, because of the different specification of the comma operator with relation to rvalues and lvalues (not the only place such a difference can be found). In C++ then the array stays an array, yielding the correct result.

What does the tilde (~) in macros mean?

25 votes

Seen on this site, the code shows macro invocations using a tilde in parentheses:

HAS_COMMA(_TRIGGER_PARENTHESIS_ __VA_ARGS__ (~))
//                                          ^^^

What does it mean / do? I suspect it to just be an empty argument, but I'm not sure. Is it maybe specific to C(99) like the __VA_ARGS__ is specific to C99 and existent in C++?

On the introduction page of Boost.Preprocessor, an example is given in A.4.1.1 Horizontal Repetition

#define TINY_print(z, n, data) data

#define TINY_size(z, n, unused)                                 \
  template <BOOST_PP_ENUM_PARAMS(n, class T)>                   \
  struct tiny_size<                                             \
      BOOST_PP_ENUM_PARAMS(n,T)                                 \
      BOOST_PP_COMMA_IF(n)                                      \
      BOOST_PP_ENUM(                                            \
          BOOST_PP_SUB(TINY_MAX_SIZE,n), TINY_print, none)      \
  >                                                             \
    : mpl::int_<n> {};

BOOST_PP_REPEAT(TINY_MAX_SIZE, TINY_size, ~) // Oh! a tilde!

#undef TINY_size
#undef TINY_print

An explanation is provided below:

The code generation process is kicked off by calling BOOST_PP_REPEAT, a higher-order macro that repeatedly invokes the macro named by its second argument (TINY_size). The first argument specifies the number of repeated invocations, and the third one can be any data; it is passed on unchanged to the macro being invoked. In this case, TINY_size doesn't use that data, so the choice to pass ~ was arbitrary. [5]

(emphasis mine)

And there is the note:

[5] ~ is not an entirely arbitrary choice. Both @ and $ might have been good choices, except that they are technically not part of the basic character set that C++ implementations are required to support. An identifier like ignored might be subject to macro expansion, leading to unexpected results.

The tilde, therefore, is simply a place holder because an argument is required, but none is necessary. Since any user-defined identifier wannabe could be expanded, you need to use something else.

It turns out that ~ is pretty much unused (binary negation is not that often called) in comparison to + or - for example, so there is little chance of confusion. Once you've settled on this, using it consistently gives it a new meaning to the tilde; like using operator<< and operator>> for streaming data has become a C++ idiom.

Is memory encrypted?

24 votes

I want to store some data in a variable (and I know variables are stored in memory). Does that data in memory get encrypted? Also, is it possible for software to be able to read the variable names stored in memory and be able to actually extract the data from it?

Memory is not encrypted on any platform I know about. It would be of limited value anyway, because the processor must, in general, operate on plaintext data, so the data must be in plaintext on the machine somewhere.

Instead, modern operating systems (and most historical ones) use memory protection to allow only certain processes access to certain memory pages. Every memory page comes with read, write, and (sometimes) execute permissions. The operating system kernel is in charge of handling those permissions on context switch to grant or deny access to memory pages per-process as needed.

Saltzer and Schroeder's 1975 paper The Protection of Information in Computer Systems describe a mechanism using segments, rather than pages, but the principle has remained unchanged for decades.

Typically, any process-owned memory page is readable by a process with high-enough privilege; the OS kernel certainly can modify any page of memory, and it can choose to delegate that privilege to user processes too. The ptrace(2) system call on Linux provides a debugger-backdoor that can be used to implement read-only memory inspection systems such as strace(1) or ltrace(1) or gdb(1), or memory-modification systems such as gdb(1) and ptrace-based sandbox environments.

Or, a core file can be dumped, under certain situations (see core(5) and setrlimit(2) manpages), containing the contents of the process's memory. This is one reason why it is important to clear memory of important data before release.

I was part of a team that worked on encrypting pointers (non-PTO link) in running programs. The overhead was amazing, and the number of corner cases was even more astonishing. Using these techniques for common programs is probably not practical, though I could imagine a restricted environment where encrypted memory or control structures is a feasible approach. (Though probably other techniques would be more appropriate.)

Can different optimization levels lead to functionally different code?

20 votes

I am curious about the liberties that a compiler has when optimizing. Let's limit this question to GCC and C/C++ (any version, any flavour of standard):

Is it possible to write code which behaves differently depending on which optimization level it was compiled with?

The example I have in mind is printing different bits of text in various constructors in C++ and getting a difference depending on whether copies are elided (though I've not been able to make such a thing work).

Counting clock cycles is not permitted. If you have an example for a non-GCC compiler, I'd be curious, too, but I can't check it. Bonus points for an example in C. :-)

Edit: The example code should be standard compliant and not contain undefined behaviour from the outset.

Edit 2: Got some great answers already! Let me up the stakes a bit: The code must constitute a well-formed program and be standards-compliant, and it must compile to correct, deterministic programs in every optimization level. (That excludes things like race-conditions in ill-formed multithreaded code.) Also I appreciate that floating point rounding may be affected, but let's discount that.

I just hit 800 reputation, so I think I shall blow 50 reputation as bounty on the first complete example to conform to (the spirit) of those conditions; 25 if it involves abusing strict aliasing. (Subject to someone showing me how to send bounty to someone else.)

The portion of the C++ standard that applies is §1.9 "Program execution". It reads, in part:

conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below. ...

A conforming implementation executing a well-formed program shall produce the same observable behavior as one of the possible execution sequences of the corresponding instance of the abstract machine with the same program and the same input. ...

The observable behavior of the abstract machine is its sequence of reads and writes to volatile data and calls to library I/O functions. ...

So, yes, code may behave differently at different optimization levels, but (assuming that all levels produce a conforming compiler), but they cannot behave observably differently.

EDIT: Allow me to correct my conclusion: Yes, code may behave differently at different optimization levels as long as each behavior is observably identical to one of the behaviors of the standard's abstract machine.