Best c++ questions in June 2012

Why is processing a sorted array faster than an unsorted array?

1088 votes

Here is a piece of code that shows some very peculiar performance. For some strange reason, sorting the data miraculously speeds up the code by almost 6x:

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;


    // !!! with this, the next loop runs faster
    std::sort(data, data + arraySize);


    // test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}
  • Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
  • With the sorted data, the code runs in 1.93 seconds.

Initially I thought this might be just a language or compiler anomaly. So I tried it Java...

import java.util.Arrays;
import java.util.Random;

public class Main
{
    public static void main(String[] args)
    {
        // generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c)
            data[c] = rnd.nextInt() % 256;


        // !!! with this, the next loop runs faster
        Arrays.sort(data);


        // test
        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i)
        {
            // primary loop
            for (int c = 0; c < arraySize; ++c)
            {
                if (data[c] >= 128)
                    sum += data[c];
            }
        }

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);
    }
}

with a similar but less extreme result.


My first thought was that sorting brings the data into cache, but my next thought was how silly that is because the array was just generated.

What is going on? Why is a sorted array faster than an unsorted array? The code is summing up some independent terms, the order should not matter.

You are the victim of branch prediction fail.


What is Branch Prediction?

Consider a railroad junction:

enter image description here Image by Mecanismo, from Wikimedia Commons: http://commons.wikimedia.org/wiki/File:Entroncamento_do_Transpraia.JPG

Now for the sake of argument, suppose this is back in the 1800s - before long distance or radio communication.

You are the operator of a junction and you hear a train coming. You have no idea which way it will go. You stop the train to ask the captain which direction he wants. And then you set the switch appropriately.

Trains are heavy and have a lot of momentum. So they take forever to start up and slow down.

Is there a better way? You guess which direction the train will go!

  • If you guessed right, it continues on.
  • If you guessed wrong, the captain will stop, back up, and yell at you to flip the switch. Then it can restart down the other path.

If you guess right every time, the train will never have to stop.
If you guess wrong too often, the train will spend a lot of time stopping, backing up, and restarting.


Consider an if-statement: At the processor level, it is a branch instruction:

enter image description here

You are a processor and you see a branch. You have no idea which way it will go. What do you do? You halt execution and wait until the previous instructions are complete. Then you continue down the correct path.

Modern processors are complicated and have long pipelines. So they take forever to "warm up" and "slow down".

Is there a better way? You guess which direction the branch will go!

  • If you guessed right, you continue executing.
  • If you guessed wrong, you need to flush the pipeline and roll back to the branch. Then you can restart down the other path.

If you guess right every time, the execution will never have to stop.
If you guess wrong too often, you spend a lot of time stalling, rolling back, and restarting.


This is branch prediction. I admit it's not the best analogy since the train could just signal the direction with a flag. But in computers, the processor doesn't know which direction a branch will go until the last moment.

So how would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every 3 times, you guess the same...

In other words, you try to identify a pattern and follow it. This is more or less how branch predictors work.

Most applications have well-behaved branches. So modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

Further Reading: http://en.wikipedia.org/wiki/Branch_predictor


As hinted from above, the culprit is this if-statement:

if (data[c] >= 128)
    sum += data[c];

Notice that the data is evenly distributed between 0 and 255. When the data is sorted, roughly the first half of the iterations will not enter the if-statement. After that, they will all enter the if-statement.

This is very friendly to the branch predictor since the branch consecutively goes the same direction many times. Even a simple saturating counter will correctly predict the branch except for the few iterations after it switches direction.

Quick visualization:

T = branch taken
N = branch not taken

data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...

       = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT  (easy to predict)

However, when the data is completely random, the branch predictor is rendered useless because it can't predict random data. Thus there will probably be around 50% misprediction. (no better than random guessing)

data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, 177, 182, 133, ...
branch =   T,   T,   N,   T,   T,   T,   T,  N,   T,   N,   N,   T,   T,   T,   N  ...

       = TTNTTTTNTNNTTTN ...   (completely random - hard to predict)

So what can be done?

If the compiler isn't able to optimize the branch into a conditional move, you can try some hacks if you are willing to sacrifice readability for performance.

Replace:

if (data[c] >= 128)
    sum += data[c];

with:

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

This eliminates the branch and replaces it with some bitwise operations.

(Note that this hack is not strictly equivalent to the original if-statement. But in this case, it's valid for all the input values of data[].)

Benchmarks: Core i7 920 @ 3.5 GHz

C++ - Visual Studio 2010 - x64 Release

//  Branch - Random
seconds = 11.777

//  Branch - Sorted
seconds = 2.352

//  Branchless - Random
seconds = 2.564

//  Branchless - Sorted
seconds = 2.587

Java - Netbeans 7.1.1 JDK 7 - x64

//  Branch - Random
seconds = 10.93293813

//  Branch - Sorted
seconds = 5.643797077

//  Branchless - Random
seconds = 3.113581453

//  Branchless - Sorted
seconds = 3.186068823

Observations:

  • With the Branch: There is a huge difference between the sorted and unsorted data.
  • With the Hack: There is no difference between sorted and unsorted data.
  • In the C++ case, the hack is actually a tad slower than with the branch when the data is sorted.

A general rule of thumb is to avoid data-dependent branching in critical loops. (such as in this example)


Update :

  • GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

  • VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.

  • Intel Compiler 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune the mispredictions, it is also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

  • If you give the Intel Compiler the branchless code, it just out-right vectorizes it... and is just as fast as with the branch (with the loop interchange).

This goes to show the even the mature modern compilers can vary wildly in their ability to optimize code...

fork() branches more than expected?

94 votes

Consider the following piece of code:

int main()
{
    int i;
    for(i = 0; i < 2; i++)
    {
        fork();
        printf(".");
    }
}

This program output is 8 dots. How can that be possible? Should not there be 6 dots instead?

The fork() primitive often stretches the imagination. Until you get a feel for it, you should trace out on paper what each operation is and account for the number of processes. Don't forget that fork() creates a near-perfect copy of the current process. The most significant difference (for most purposes) is that fork()'s return value differs between parent and child. (Since this code ignores the return value, it makes no difference.)

So, at first, there is one process. That creates a second process, both of which print a dot and loop. On their second iteration, each creates another copy, so there are four processes print a dot, and then exit. So we can easily account for six dots, like you expect.

However, what printf() really does is buffer its output. So the first dot from when there were only two processes does not appear when written. Those dots remain in the buffer—which is duplicated at fork(). It is not until the process is about to exit that the buffered dot appears. Four processes printing a buffered dot, plus the new one gives 8 dots.

If you wanted to avoid that behavior, call fflush(stdout); after printf().

Why do people say there is modulo bias when using a random number generator?

78 votes

I have seen this question asked a lot but never seen a true concrete answer to it. So I am going to post one here which will hopefully help people understand why exactly there is "modulo bias" when using a random number generator, like rand() in C++.

So rand() is a pseudo-random number generator which chooses a natural number between 0 and RAND_MAX, which is a constant defined in cstdlib (see this article for a general overview on rand()).

Now what happens if you want to generate a random number between say 0 and 2. For the sake of explanation, lets say RAND_MAX was 10 and I decide that the best way to generate a random number between 0 and 2 is to do rand()%3. Assuming rand() does generate each number between 0 and 10 with equal probability, (this is arguable but for this post I will assume it does), why would rand()%3 not produce the numbers between 0 and 2 with equal probability? When rand() returns 0, 3, 6, or 9, rand()%3 == 0. When rand() returns 1, 4, 7, or 10, rand()%3 == 1. When rand() returns 2, 5, or 8, rand()%3 == 2. Now if we analyze this statistically, we very quickly see that the probability of getting a 0 is 4/11, 1 is 4/11 but 2 is 3/11. This does not generate the numbers between 0 and 2 with equal probability. Of course for small ranges this might not be the biggest issue but for a larger range this could skew the distribution, biasing the smaller numbers.

So when does rand()%n return a range of numbers from 0 to n-1 with equal probability? When RAND_MAX%n == n - 1. In this case, along with our earlier assumption rand() does return a number between 0 and RAND_MAX with equal probability, the modulo classes of n would also be equally distributed.

So how do we solve this problem? One way is to keep generating random numbers till you get a number in your desired range:

int x; 
do
{
    x = rand();
} while (x >= n);

Hope that helps everyone!

Works cited and further reading:

Including header files in C/C++ just once

37 votes

Is it ever useful to include a header file more than once in C or C++?

If the mechanism is never used, why would the compiler ever worry about including a file twice; if it really were useless, wouldn't it be more convenient if newer compilers made sure every header is included only once?

Edit:

I understand that there are standard ways of doing things like include guards and pragma once, but why should you have to specify even that? Shouldn't it be the default behavior of the compiler to include files only once?

Yes, it's useful when generating code with the preprocessor, or doing tricks like Boost.PP does.

For an example, see X Macros. The basic idea is that the file contains the body of the macro and you #define the arguments and then #include it. Here's a contrived example:

macro.xpp

std::cout << MESSAGE;
#undef MESSAGE

file.cpp:

int main() {
# define MESSAGE "hello world"
# include "macro.xpp"
}

This also allows you to use #if and friends on the arguments, something that normal macros can't do.

How can I specialize a C++ template for a range of values?

32 votes

Is there a way to have a template specialization based on a range of values instead of just one? I know the following code is not valid C++ code but it shows what I would like to do. I'm writing code for a 8-bit machine, so there is a difference in speed for using ints and chars.

template<unsigned SIZE>
class circular_buffer {
   unsigned char buffer[SIZE];
   unsigned int head; // index
   unsigned int tail; // index
};

template<unsigned SIZE <= 256>
class circular_buffer {
   unsigned char buffer[SIZE];
   unsigned char head; // index
   unsigned char tail; // index
};

Try std::conditional:

#include <type_traits>

template<unsigned SIZE>
class circular_buffer {

    typedef typename
        std::conditional< SIZE < 256,
                          unsigned char,
                          unsigned int
                        >::type
        index_type;

    unsigned char buffer[SIZE];
    index_type head;
    index_type tail;
};

If your compiler doesn't yet support this part of C++11, there's equivalent in boost libraries.

Then again, it's easy to roll your own (credit goes to KerrekSB):

template <bool, typename T, typename F>
struct conditional {
    typedef T type;
};

template <typename T, typename F>  // partial specialization on first argument
struct conditional<false, T, F> {
    typedef F type;
}; 

Does moving a vector invalidate iterators?

31 votes

If I have an iterator to a vector, then I move-construct or move-assign another vector from that vector, does that iterator still point to a valid element in the new vector? Here's a simple example:

#include <vector>
#include <iostream>

int main(int argc, char *argv[])
{
    std::vector<int>::iterator a_iter;
    std::vector<int> b;
    {
        std::vector<int> a{1, 2, 3, 4, 5};
        a_iter = a.begin() + 2;
        b = std::move(a);
    }
    std::cout << *a_iter << std::endl;
    return 0;
}

Is a_iter still valid since a has been moved into b, or is the iterator invalidated by the move? For reference, std::vector::swap does not invalidate iterators.

While it might be reasonable to assume that iterators are still valid after a move, I don't think the Standard actually guarantees this. Therefore, the iterators are in an undefined state after the move.


There is no reference I can find in the Standard which specifically states that iterators that existed before a move are still valid after the move.

On the surface, it would seem to be perfectly reasonable to assume that an iterator is typically implemented as pointers in to the controlled sequence. If that's the case, then the iterators would still be valid after the move.

But the implementation of an iterator is implementation-defined. Meaning, so long as the iterator on a particular platform meets the requirements set forth by the Standard, it can be implemented in any way whatsoever. It could, in theory, be implemented as a combination of a pointer back to the vector class along with an index. If that's the case, then the iterators would become invalid after the move.

Whether or not an iterator is actually implemented this way is irrelevant. It could be implemented this way, so without a specific guarantee from the Standard that post-move iterators are still valid, you cannot assume that they are. Bear in mind also that there is such a guarantee for iterators after a swap. This was specifically clarified from the previous Standard. Perhaps it was simply an oversight of the Std comittee to not make a similar clarification for iterators after a move, but in any case there is no such guarantee.

Therefore, the long and the short of it is you can't assume your iterators are still good after a move.

EDIT:

23.2.1/11 in Draft n3242 states that:

Unless otherwise specified (either explicitly or by defining a function in terms of other functions), invoking a container member function or passing a container as an argument to a library function shall not invalidate iterators to, or change the values of, objects within that container.

This might lead one to conclude that the iterators are valid after a move, but I disagree. In your example code, a_iter was an iterator in to the vector a. After the move, that container, a has certainly been changed. My conclusion is the above clause does not apply in this case.

What's "wrong" with C++ wchar_t and wstrings? What are some alternatives to wide characters?

28 votes

I have seen a lot of people in the C++ community(particularly ##c++ on freenode) resent the use of wstrings and wchar_t, and their use in the windows api. What is exactly "wrong" with wchar_t and wstring, and if I want to support internationalization, what are some alternatives to wide characters?

What is wchar_t?

wchar_t is defined such that any locale's char encoding can be converted to wchar_t where every wchar_t represents exactly one codepoint:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).

                                                                               — C++ [basic.fundamental] 3.9.1/5

This does not require that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.

Since that seems to be the primary use in practice for wchar_t you might wonder what it's good for if not that.

The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of the same simple algorithms as are used with ascii strings to work with other languages.

Unfortunately the requirements on wchar_t assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption, so you can't safely use wchar_t for simple text algorithms either.

This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.

What use is wchar_t today?

Not much, for portable code anyway. If __STDC_ISO_10646__ is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.

The reason Windows doesn't define __STDC_ISO_10646__ I think is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__.

For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').

In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes. Non-portable code may find it useful simply because some API requires it.

Alternatives

The alternative I like is to use UTF-8 encoded C strings, even on platforms not particularly friendly with UTF-8.

This way one can write portable code using a common text representation across platforms, use standard datatypes for their intended purpose, get the language's support for those types (e.g. string literals, though some tricks are necessary to make it work for some compilers), some standard library support, debugger support (more tricks may be necessary), etc. With wide characters it's generally harder or impossible to get all of this, and you may get different pieces on different platforms.

Many platforms use UTF-8 as their native char encoding, and so writing an internationalized program on those platforms is little different from writing code without considering internationalization. Writing more widely portable code, or writing on other platforms requires inserting conversions at the boundaries of APIs that use something else.

Another alternative used by some software is to choose a cross-platform representation, such as unsigned short arrays holding UTF-16 data, and then to supply all the library support and simply live with the costs in language support, etc.

C++11 adds new kinds of wide characters as alternatives to wchar_t, char16_t and char32_t with attendant language/library features. These aren't actually guaranteed to be UTF-16 and UTF-32, but I don't imagine any major implementation will use anything else. C++11 also improves UTF-8 support, for example with UTF-8 string literals so it won't be necessary to trick VC++ into producing UTF-8 encoded strings (although I may continue to do so rather than use the u8 prefix).

Alternatives to avoid

TCHAR: TCHAR is for migrating ancient Windows programs that assume legacy encodings from char to wchar_t, and is best forgotten unless your program was written in some previous millennium. It's not portable and is inherently unspecific about its encoding and even its data type, making it unusable with any non-TCHAR based API. Since its purpose is migration to wchar_t, which we've seen above isn't a good idea, there is no value whatsoever in using TCHAR.

Why does C++ mandate that complex only be instantiated for float, double, or long double?

25 votes

According to the C++ ISO spec, §26.2/2:

The effect of instantiating the template complex for any type other than float, double or long double is unspecified.

Why would the standard authors explicitly add this restriction? This makes it unspecified, for example, what happens if you make complex<int> or a complex<MyCustomFixedPointType> and seems like an artificial restriction.

Is there a reason for this limitation? Is there a workaround if you want to instantiate complex with your own custom type?

I'm primarily asking this question because of this earlier question, in which the OP was confused as to why abs was giving bizarre outputs for complex<int>. That said, this still doesn't quite make sense given that we also might want to make complex numbers out of fixed-points types, higher-precision real numbers, etc.

Thanks!

You can't properly implement many of the std::complex operations on integers. E.g.,

template <class T>
T abs(const complex<T> &z);

for a complex<long> cannot have T = long return value when complex numbers are represented as (real,imag) pairs, since it returns the value of sqrt(pow(z.real(), 2) + pow(z.imag(), 2)). Only a few of the operations would make sense.

Even worse, the polar named constructor cannot be made reliable without breaking the default constructor and vice versa. The standard would have to specify that "complex integers" are Gaussian integers for them to be of any use, and that one of the constructors is severely broken.

Finally, how would you like your "complex integer division" served, and would you like a "complex remainder" with that? :)

Summarizing, I think it would be more sensible to specify a separate gaussian_int<T> type with just a few operations than graft support for integral T onto std::complex.

Is const a lie? (since const can be cast away)

22 votes

Possible Duplicate:
Sell me on const correctness

What is the usefulness of keyword const in C or C++ since it's allowed such a thing?

void const_is_a_lie(const int* n)
{ 
    *((int*) n) = 0;
}

int main()
{
    int n = 1;
    const_is_a_lie(&n);
    printf("%d", n);
    return 0;
}

Output: 0

It is clear that const cannot guarante the non-modifiability of the argument.

const is a promise you make to the compiler, not something it guarantees you.

See http://ideone.com/Ejogb

Because of the const, the compiler is allowed to assume that the value won't change, and therefore it can skip rereading it, if that would make the program faster.

In this case, since const_is_a_lie() violates its contract, weird things happen. Don't violate the contract. And be glad that the compiler gives you help keeping the contract. Casts are evil.

Java and c++ encrypted results not matching

21 votes

I have a existing c++ code which will encrypt a string. Now I did the same encryption in . Some of the encrypted strings are matching . Some are mismatching in one or two characters.

I am unable to figure out why it is happening. I ran both the codes in debug mode until they call their libraries both have the same key, salt, iv string to be encrypted.

I know that even if a single byte padding change will modify encrypted string drastically. But here I am just seeing a one or two characters change. Here is a sample (Bold characters in between stars is the part mis matching)

java:

U2FsdGVkX18xMjM0NTY3OGEL9nxFlHrWvodMqar82NT53krNkqat0rrgeV5FAJFs1vBsZIJPZ08DJVrQ*Pw*yV15HEoyECBeAZ6MTeN+ZYHRitKanY5jiRU2J0KP0Fzola

C++:

U2FsdGVkX18xMjM0NTY3OGEL9nxFlHrWvodMqar82NT53krNkqat0rrgeV5FAJFs1vBsZIJPZ08DJVrQ*jQ*yV15HEoyECBeAZ6MTeN+ZYHRitKanY5jiRU2J0KP0Fzola

I am using AES encryption. provider is SunJCE version 1.6. I tried changing provider to Bouncy Castle. Even then result is same.

Added One More sample:

C++:

U2FsdGVkX18xMjM0NTY3O*I*/BMu11HkHgnkx+dLPDU1lbfRwb+aCRrwkk7e9dy++MK+/94dKLPXaZDDlWlA3gdUNyh/Fxv*oF*STgl3QgpS0XU=

java:

U2FsdGVkX18xMjM0NTY3O*D*/BMu11HkHgnkx+dLPDU1lbfRwb+aCRrwkk7e9dy++MK+/94dKLPXaZDDlWlA3gdUNyh/Fxv*j9*STgl3QgpS0XU=

UPDATE:

As per the comments I feel base 64 encryption is the culprit. I am using Latin-1 char set in both places. Anything else that I can check

Sigh!!

The problem almost certainly is that after you encrypt the data and receive the encrypted data as a byte string, you are doing some sort of character conversion on the data before sending it through Base-64 conversion.

Note that if you encrypt the strings "ABCDEFG" and "ABCGEFG", the encrypted output will be completely different starting with the 4th character, and continuing to the end. In other words, the Base-64 outputs would be something like (using made-up values):

U2FsdGVkX18xMj

and

U2FsdGXt91mJpz

The fact that, in the above examples, only two isolated Base-64 characters (one byte) are messed up in each case pretty much proves that the corruption occurs AFTER encryption.

The output of an encryption process is a byte sequence, not a character sequence. The corruption observed is consistent with erroneously interpreting the bytes as characters and attempting to perform a code page conversion on them, prior to feeding them into the Base-64 converter. The output from the encryptor should be fed directly into the Base-64 converter without any conversions.

You say you are using the "Latin-1 char set in both places", a clear sign that you are doing some conversion you should not be doing -- there should be no need to muck with char sets.

C++ vector that *doesn't* initialize its members?

20 votes

I'm making a C++ wrapper for a piece of C code that returns a large array, and so I've tried to return the data in a vector<unsigned char>.

Now the problem is, the data is on the order of megabytes, and vector unnecessarily initializes its storage, which essentially turns out to cut down my speed by half.

How do I prevent this?

Or, if it's not possible -- is there some other STL container that would avoid such needless work? Or must I end up making my own container?

(Pre-C++11)

Note:

I'm passing the vector as my output buffer. I'm not copying the data from elsewhere.
It's something like:

vector<unsigned char> buf(size);   // Why initialize??
GetMyDataFromC(&buf[0], buf.size());

For default and value initialization structs with user-provided default constructors that don't initialized anything, no initialization is performed on unsigned char members:

struct uninitialized_char {
    unsigned char m;
    uninitialized_char() {}
};

// just to be safe
static_assert(1 == sizeof(uninitialized_char), "");

std::vector<uninitialized_char> v(4 * (1<<20));

GetMyDataFromC(reinterpret_cast<unsigned char*>(&v[0]), v.size());

I think this is even legal under the strict aliasing rules.

When I compared the construction time for v vs. a vector<unsigned char> I got ~8 µs vs ~12 ms. More than 1000x faster. Compiler was clang 3.2 with libc++ and flags: -std=c++11 -Os -fcatch-undefined-behavior -ftrapv -pedantic -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-missing-prototypes

C++11 has a helper for uninitialized storage, std::aligned_storage. Though it requires a compile time size.

Call a function before main

16 votes

Possible Duplicate:
Is main() really start of a C++ program?

Is possible to call my function before program's startup? How can i do this work in C++ or C?

You can have a global variable or a static class member.

1) static class member

//BeforeMain.h
class BeforeMain
{
    static bool foo;
};

//BeforeMain.cpp
#include "BeforeMain.h"
bool BeforeMain::foo = foo();

2) global variable

bool b = foo();
int main()
{
}

Note this link - http://www.parashift.com/c++-faq-lite/ctors.html#faq-10.14 - posted by Lundin.

printing float, preserving precision

14 votes

I am writing a program that prints floating point literals to be used inside another program.

How many digits do I need to print in order to preserve the precision of the original float?

Since a float has 24 * (log(2) / log(10)) = 7.2247199 decimal digits of precision, my initial thought was that printing 8 digits should be enough. But if I'm unlucky, those 0.2247199 get distributed to the left and to the right of the 7 significant digits, so I should probably print 9 decimal digits.

Is my analysis correct? Is 9 decimal digits enough for all cases? Like printf("%.9g", x);?

Is there a standard function that converts a float to a string with the minimum number of decimal digits required for that value, in the cases where 7 or 8 are enough, so I don't print unnecessary digits?

Note: I cannot use hexadecimal floating point literals, because standard C++ does not support them.

In order to guarantee that a binary->decimal->binary roundtrip recovers the original binary value, IEEE 754 requires


The original binary value will be preserved by converting to decimal and back again using:[10]

    5 decimal digits for binary16
    9 decimal digits for binary32
    17 decimal digits for binary64
    36 decimal digits for binary128

For other binary formats the required number of decimal digits is

    1 + ceiling(p*log10(2)) 

where p is the number of significant bits in the binary format, e.g. 24 bits for binary32.

In C, the functions you can use for these conversions are snprintf() and strtof/strtod/strtold().

Of course, in some cases even more digits can be useful (no, they are not always "noise", depending on the implementation of the decimal conversion routines such as snprintf() ). Consider e.g. printing dyadic fractions.

The named loop idiom : dangerous?

13 votes

I've read an article about the "Named Loop Idiom" in C++ : http://en.wikibooks.org/wiki/More_C%2B%2B_Idioms/Named_Loop

This idiom allows us to write things like that :

named(outer) 
for(int i = 0 ; i < rows ; ++i) {

   named(inner) 
   for(int j = 0 ; j < cols ; ++j) {

        if(some_condition)
            break(outer);   // exit the 'outer' loop 

   }
}

Such constructs already exists as core feature in many languages, like Java for instance.

According to the article, it can be implemented in C++ by defining two evil macros :

#define named(blockname) goto blockname; \
                         blockname##_skip: if (0) \
                         blockname:

#define break(blockname) goto blockname##_skip;

I know that many people would like to banish the use of goto. I personally found it helpful in very rare cases, especially when I wanted to break a bunch of nested loops. This idiom appears to me as a cleaner solution for that, but is it ok to use it in real code ?

On the discussion page of the article, one can read :

"Do not do this. You'll end up in hell"

So my questions are : What are the drawbacks of using the named loop idiom ? Is it dangerous ? If yes, why ?

Bonus question : is it possible to implement named continue similarly ? (I think it's not possible using the named(...) for(...;...;...) {} syntax, but who knows ?)

EDIT : I agree with you, redefining a keyword is nasty. What about using #define breakLoop() instead?

As covered in the comments, #defining break is problematic. Let's assume you use something else.

I'd still argue that this is dangerous. It's an extremely unusual idiom (to C++ programmers), so they're less likely to understand, and thus they might make breaking changes. Given that there are less-surprising--and therefore less-dangerous--ways to accomplish the same thing, I would advise against it.

Consider putting the loops in a function or a lambda. Then you can return to break out of the outer loop. As a benefit, you can return information about the premature exit, which may be useful to the outer code.

How to log or replay lines or instructions executed immediately before a crash

9 votes

Often I have to debug crashing C++ programs on Windows where I can reproduce the crash, but it is hard to determine what sequence of instructions in the code caused the crash (e.g. another thread overwriting memory of the crashing thread). Even a call stack does not help in that case. Usually I resort to narrowing down the crash cause by commenting out sections of the source code, but this is very tedious.

Does anyone know a tool for Windows that can report or replay the last few source code lines or machine code instructions executed in all threads immediately before a crash? I.e. something like the reverse debugging capability of gdb or something like Mutek's BugTrapper (which no longer is available). I am looking for a released and stable tool (I am aware of SoftwareVerify's 'Bug Validator' and Hexray's IDA Pro 6.3 Trace Replayer, both of which still are in closed beta programs).

What I already tried were the WinDbg trace commands wt and ta @$ra, but both commands have the disadvantage that they stop automatically after a few seconds. I require trace commands that run until the crash happens, and that trace all threads of the running program.

NOTE: I am not looking for a debug tool designed to fix a particular problem, like gflags, pageheap, Memory Validator, Purify, etc. I am looking for released and stable tool to trace or replay at the instruction level.

I found a solution: "replay debugging" using VMware Workstation and Visual Studio 2010. Setting it up takes a lot of time, but you are rewarded with a Visual Studio C++ debugger that can debug backwards in time. Here is a video that demonstrates how replay debugging works: http://blogs.vmware.com/workstation/2010/01/replay-debugging-try-it-today.html.

A drawback of the solution is that VMware seemingly has discontinued replay debugging in the latest VMware versions. Furthermore, only certain processor types seem to support replaying. I have not found any comprehensive list of supported processors; I tested the replay features on three of my PCs: replaying did not work on a Core i7 200; replaying worked on a Core2 6700 and on a Core2 Q9650.

I really hope that VMware reconsiders and introduces replay debugging again in future VMware Workstation versions, because this really adds a new dimension to debugging.

For those of you who are interested, here is a description how you can set up an environment for replay debugging:

In the description below, "local debugging" means that Visual Studio and VMware are installed on the same PC. "Remote debugging" means that Visual Studio and VMware are installed on different PCs.

  • Install Visual Studio 2010 with SP1 on the host system.

  • Make sure Visual Studio has been configured to use Microsoft's symbol servers. (Under "Tools | Options | Debugging | Symbols").

  • On the host system, install "Debugging Tools for Windows".

  • Install VMware Workstation 7.1. (Version 8.0 no longer contains the replay debugging feature). This will also install a plug-in into Visual Studio.

  • Install a virtual machine (VM) on VMware with Windows XP SP3.

  • If the application under test is a debug build, install the Visual Studio debug DLLs on the VM. (See http://msdn.microsoft.com/en-us/library/dd293568.aspx for instructions how to do that, but use a "Debug" configuration instead of "Release").

  • Copy "gflags.exe" from the host's "Debugging Tools for Windows" directory to the VM, run gflags.exe on the VM, select "Disable paging of kernel stacks" under the "System Registry tab" and press OK. Reboot the VM.

  • Copy all EXE and DLL files of the application under test to the VM and make sure that you can start the application and reproduce the problem.

  • Shutdown the VM and create a snapshot (via context menu item "Take Snapshot" in VMware Workstation).

  • (Only for remote debugging:) Start the following command on the Visual Studio PC and enter an arbitrary passcode:

    C:\Program Files\VMware\VMware Workstation\Visual Studio Integrated Debugger\dclProxy.exe hostname

    Replace hostname by the name of the PC.

  • (Only for remote debugging:) Create a recording manually for the VM. I.e. log in to the VM's operating system, start the recording (via context menu "Record"), run the application under test and perform the actions necessary to reproduce the problem. Then stop and save the recording.

  • Start Visual Studio and go to "VMware | Options | Replay Debugging in VM | General", and set the following values:

    • "Local or Remote" must be set to "Local" for local debugging or to "Remote" for remote debugging.
    • "Virtual Machine" must be set to the path to the VM's .vmx file .
    • "Remote Machione Passcode" must be set to be passcode you used above (only for remote debugging).
    • "Recording to Replay" must be set to a recording name that you previously created with VMware.
    • "Host Executable Search Path" must be set to a directory in which you save DLLs which are required by the application under test and which are needed by Visual Studio to display correct stack traces.

    Press "Apply".

  • Go to "VMware | Options | Replay Debugging in VM | Pre-Record Event", and set the following values:

    • "Base Snapshot for Recording": name of snapshot created previously.

    Press "OK".

  • (For local debugging:) In Visual Studio, select "VMware | Create Recording for Replay"; this restarts the VM. Login to the VM, run the application under test and perform the actions necessary to reproduce the problem. Then stop and save the recording.

  • Select "VMware | Start Replay Debugging". VMware now automatically restarts the VM and the application under test and replays the recorded actions. Wait until the application crashes; the Visual Studio debugger then automatically becomes active.

  • In the Visual Studio debugger, set a breakpoint to a location where you think the application has been before the crash. Then, select "VMware | Reverse Continue". The debugger now runs backwards to the breakpoint. This operation can take some time because the VM will be restarted and replayed until your breakpoint is reached. (You can speed up this operation by adding a snapshot a few seconds before the crash happens when you record the scenario. You can add additional snapshots during replay debugging.)

  • Once VMware has replayed the VM to your breakpoint, you can use "Step Over" and "Step Into" to step forward from your breakpoint, i.e. you replay the recorded history of events, until you reach a point where you can identify the reason why your application crashed.

Further information: http://www.replaydebugging.com/

C++ regex not understanding

8 votes

The following outputs ">Hut" where I expect it to output "Hut". I know that .* is greedy but > must be matched and it is outside of the capture group so why is it in my submatch?

#include <string>
#include <regex>
#include <iostream>

using namespace std;

int main() {
        regex my_r(".*>(.*)");
        string temp(R"~(cols="64">Hut)~");
        smatch m;
        if (regex_match(temp, m, my_r)) {
                cout << m[1] << endl;
        }
}

This is a bug in libstdc++'s implementation. Watch these:

#include <string>
#include <regex>
#include <boost/regex.hpp>
#include <iostream>

int main() {
    {
        using namespace std;
        regex my_r("(.*)(6)(.*)");
        smatch m;
        if (regex_match(std::string{"123456789"}, m, my_r)) {
            std::cout << m.length(1) << ", "
                      << m.length(2) << ", "
                      << m.length(3) << std::endl;
        }
    }

    {
        using namespace boost;
        regex my_r("(.*)(6)(.*)");
        smatch m;
        if (regex_match(std::string{"123456789"}, m, my_r)) {
            std::cout << m.length(1) << ", "
                      << m.length(2) << ", "
                      << m.length(3) << std::endl;

        }
    }

    return 0;
}

If you compile with gcc, the first one (libstdc++) returns the totally wrong result 9, -2, 4 and the second one (boost's implementation) returns 5, 1, 3 as expected.

If you compile with clang + libc++, your code works fine.

(Note that libstdc++'s regex implementation is only "partially supported", as described in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52719.)

What is the closest C++ analogue to Ruby's Rack?

6 votes

I'm a big fan of Rack, and I've used it to build several lightweight web apps over the past few years. I've been curious for a while if something similar exists for C++. I've spent quite a bit of time searching Google and come up empty-handed. It doesn't help that I find Rack hard to describe. Its tagline is "A Ruby Webserver Interface". Searching for {c++ "webserver interface"}, I've found things that do much more than I want, like wt, and I've found suggestions to use FastCGI directly. I feel like Rack fits squarely between these two options.

I'm not sure if I'm having trouble finding a C++ analogue to Rack because no such thing exists or because I'm just using poor search terms.

Is there a close C++ analogue to Rack? If not, is there a library or small set of libraries that can do most of the lower-level, error-prone stuff for me, but still leave me with the level of control that Rack does?

Here are the best options I've found so far:

  • cpp-net-lib (Thanks @Managu) - This appears to be close to what I had in mind.
  • fastcgi++ - This appears to offers lots of niceties over straight FastCGI without turning into a full framework -- so also close to what I had in mind.
  • Mongrel2 - According to Zed, "Mongrel2's protocol also tends to remove the need for any 'middleware' like WSGI or Rack since its protocol is already similar to what those do." This comes from a very different angle, but also looks like it satisfies my general criteria.

how to detect if a thread or process is getting starved due to OS scheduling

6 votes

This is on Linux OS. App is written in C++ with ACE library.

I am suspecting that one of the thread in the process is getting blocked for unusually long time(5 to 40 seconds) sometimes. The app runs fine most of the times except couple times a day it has this issue. There are other similar 5 apps running on the box which are also I/O bound due to heavy socket incoming data.

I would like to know if there is any thing I can do programatically to see if the thread/process are getting their time slice.

If a process is being starved out, self monitoring for that process would not be that productive. But, if you just want that process to notice it hasn't been run in a while, it can call times periodically and compare the relative difference in elapsed time with the relative difference in scheduled user time (you would sum the tms_utime and tms_cutime fields if you want to count waiting for children as productive time, and you would sum in the tms_stime and tms_cstime fields if you count kernel time spent on your behalf to be productive time). For thread times, the only way I know of is to consult the /proc filesystem.

A high priority external process or high priority thread could externally monitor processes (and threads) of interest by reading the appropriate /proc/<pid>/stat entries for the process (and /proc/<pid>/task/<tid>/stat for the threads). The user times are found in the 14th and 16th fields of the stat file. The system times are found in the 15th and 17th fields. (The field positions are accurate for my Linux 2.6 kernel.)

Between two time points, you determine the amount of elapsed time that has passed (a monitor process or thread would usually wake up at regular intervals). Then the difference between the cumulative processing times at each of those time points represents how much time the thread of interest got to run during that time. The ratio of processing time to elapsed time would represent the time slice.

One last bit of info: On Linux, I use the following to obtain the tid of the current thread for examining the right task in the /proc/<pid>/task/ directory:

tid = syscall(__NR_gettid);

I do this, because I could not find the gettid system call actually exported by any library on my system, even though it was documented. But, it might be available on yours.

Can I get the C++ preprocessor to send output during compilation?

6 votes

I have been debugging a particularly insidious bug which I now believe to be caused by unexpected changes which stem from different behavior when different headers are included (or not).

This is not exactly the structure of my code but let's just take a look at this scenario:

#include "Newly_created_header_which_accidentally_undefines_SOME_DEFINE.h"

// ...

#ifdef SOME_DEFINE
    code_which_i_believe_i_am_always_running();
#else 
    code_which_fails_which_i_have_forgotten_about(); // runtime error stack traces back here, but I don't know this... or maybe it's some strange linker error
#endif

I search through my git commits and narrow down the cause of the bug, compiling and running my code countless times, only to find after several hours that the only difference required for causing the bug is the inclusion of what appears to be a completely benign and unrelated header.

Perhaps this is a great argument for why the preprocessor basically just sucks.

But I like it. The preprocessor is cool because it lets us make shortcuts. It's only that some of these shortcuts, when not used carefully, bite us in the butt pretty hard.

So at this juncture it would have helped if I could use a directive like #echo "Running old crashy code" where I'll be able to see this during compilation so I could be tipped off immediately to start investigating why SOME_DEFINE was not defined.

As far as I know the straightforward way of determining if SOME_DEFINE is defined is to do something like

#ifndef SOME_DEFINE
    printf("SOME_DEFINE not defined!!\n");

This will surely get the job done but there is no good reason for this task to be performed at runtime because it is entirely determined at compile-time. This is simply something I'd like to see at compile-time.

That being said, in this situation, using the print (or log or even throwing an exception) may be an acceptable thing to do because I won't really care about slowing down or cluttering up the questionable code. But that doesn't apply if I have for instance two code paths both of which are important, and I just want to know at compile-time which one is being activated. I'd have to worry about running the code that does the preprocessor-conditioned print at the beginning of the program.

This is really just a long-winded way of asking the question, "Can I echo a string to the output during compilation by using a preprocessor directive?"

Answer more in line with what I was looking for is here: http://stackoverflow.com/a/3826876/340947

Sorry @sarnold

How to create a NSAutoreleasePool without Objective-C?

6 votes

I have multiplatform game written in C++. In the mac version, even though I do not have any obj-c code, one of the libraries I use seems to be auto-releasing stuff, and I get memory leaks for that, since I did not create a NSAutoreleasePool.

What I want is to be able to create (and destroy) a NSAutoreleasePool without using obj-c code, so I don't need to create a .m file, and change my build scripts just for that. Is that possible? How can that be done?

OBS: Tagged C and C++, because a solution in any of those languages will do.

You can't avoid instantiating the Objective-C runtime—but apparently you've already got one of those.

If you want to interact with the runtime from C, you can us the Objective-C runtime APIs, as documented in Objective-C Runtime Programming Guide and Objective-C Runtime Reference.

The idea is something like this (untested):

#include <objc/runtime.h>
#include <objc/objc-runtime.h>
id allocAndInitAutoreleasePool() {
  Class NSAutoreleasePoolClass = objc_getClass("NSAutoreleasePool");
  id pool = class_createInstance(NSAutoreleasePoolClass, 0);
  return objc_msgSend(pool, "init");
}
void drainAutoreleasePool(id pool) {
  (void)objc_msgSend(pool, "drain");
}

If you want to call these functions from another file, of course you'll have to include objc/runtime.h there as well. Or, alternatively, you can cast the id to void* in the return from the allocAndInit function, and take a void* and cast back to id in the drain function. (You could also forward-declare struct objc_object and typedef struct objc_object *id, but I believe that's not actually guaranteed to be the right definition.)

You shouldn't have to pass -lobjc in your link command.

Needless to say, it's probably less work to just make your build scripts handle .m files.