Best optimization questions in April 2012

Why does GCC generate such radically different assembly for nearly the same C code?

89 votes

While writing an optimized ftol function I found some very odd behaviour in GCC 4.6.1. Let me show you the code first (for clarity I marked the differences):

fast_trunc_one, C:

int fast_trunc_one(int i) {
    int mantissa, exponent, sign, r;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);
    sign = i & 0x80000000;

    if (exponent < 0) {
        r = mantissa << -exponent;                       /* diff */
    } else {
        r = mantissa >> exponent;                        /* diff */
    }

    return (r ^ -sign) + sign;                           /* diff */
}

fast_trunc_two, C:

int fast_trunc_two(int i) {
    int mantissa, exponent, sign, r;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);
    sign = i & 0x80000000;

    if (exponent < 0) {
        r = (mantissa << -exponent) ^ -sign;             /* diff */
    } else {
        r = (mantissa >> exponent) ^ -sign;              /* diff */
    }

    return r + sign;                                     /* diff */
}

Seems the same right? Well GCC disagrees. After compiling with gcc -O3 -S -Wall test.s test.c this is the assembly output:

fast_trunc_one, generated:

_fast_trunc_one:
LFB0:
    .cfi_startproc
    movl    4(%esp), %eax
    movl    $150, %ecx
    movl    %eax, %edx
    andl    $8388607, %edx
    sarl    $23, %eax
    orl $8388608, %edx
    andl    $255, %eax
    subl    %eax, %ecx
    movl    %edx, %eax
    sarl    %cl, %eax
    testl   %ecx, %ecx
    js  L5
    rep
    ret
    .p2align 4,,7
L5:
    negl    %ecx
    movl    %edx, %eax
    sall    %cl, %eax
    ret
    .cfi_endproc

fast_trunc_two, generated:

_fast_trunc_two:
LFB1:
    .cfi_startproc
    pushl   %ebx
    .cfi_def_cfa_offset 8
    .cfi_offset 3, -8
    movl    8(%esp), %eax
    movl    $150, %ecx
    movl    %eax, %ebx
    movl    %eax, %edx
    sarl    $23, %ebx
    andl    $8388607, %edx
    andl    $255, %ebx
    orl $8388608, %edx
    andl    $-2147483648, %eax
    subl    %ebx, %ecx
    js  L9
    sarl    %cl, %edx
    movl    %eax, %ecx
    negl    %ecx
    xorl    %ecx, %edx
    addl    %edx, %eax
    popl    %ebx
    .cfi_remember_state
    .cfi_def_cfa_offset 4
    .cfi_restore 3
    ret
    .p2align 4,,7
L9:
    .cfi_restore_state
    negl    %ecx
    sall    %cl, %edx
    movl    %eax, %ecx
    negl    %ecx
    xorl    %ecx, %edx
    addl    %edx, %eax
    popl    %ebx
    .cfi_restore 3
    .cfi_def_cfa_offset 4
    ret
    .cfi_endproc

That's an extreme difference. This actually shows up on the profile too, fast_trunc_one is around 30% faster than fast_trunc_two. Now my question: what is causing this?

Updated to sync with the OP's edit

By tinkering with the code, I've managed to see how GCC optimizes the first case.

Before we can understand why they are so different, first we must understand how GCC optimizes fast_trunc_one().

Believe it or not, fast_trunc_one() is being optimized to this:

int fast_trunc_one(int i) {
    int mantissa, exponent;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);

    if (exponent < 0) {
        return (mantissa << -exponent);             /* diff */
    } else {
        return (mantissa >> exponent);              /* diff */
    }
}

This produces the exact same assembly as the original fast_trunc_one() - register names and everything.

Notice that there are no xors in the assembly for fast_trunc_one(). That's what gave it away for me.


How so?


Step 1: sign = -sign

First, let's take a look at the sign variable. Since sign = i & 0x80000000;, there are only two possible values that sign can take:

  • sign = 0
  • sign = 0x80000000

Now recognize that in both cases, sign == -sign. Therefore, when I change the original code to this:

int fast_trunc_one(int i) {
    int mantissa, exponent, sign, r;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);
    sign = i & 0x80000000;

    if (exponent < 0) {
        r = mantissa << -exponent;
    } else {
        r = mantissa >> exponent;
    }

    return (r ^ sign) + sign;
}

It produces the exact same assembly as the original fast_trunc_one(). I'll spare you the assembly, but it is identical - register names and all.


Step 2: Mathematical reduction: x + (y ^ x) = y

sign can only take one of two values, 0 or 0x80000000.

  • When x = 0, then x + (y ^ x) = y then trivial holds.
  • Adding and xoring by 0x80000000 is the same. It flips the sign bit. Therefore x + (y ^ x) = y also holds when x = 0x80000000.

Therefore, x + (y ^ x) reduces to y. And the code simplifies to this:

int fast_trunc_one(int i) {
    int mantissa, exponent, sign, r;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);
    sign = i & 0x80000000;

    if (exponent < 0) {
        r = (mantissa << -exponent);
    } else {
        r = (mantissa >> exponent);
    }

    return r;
}

Again, this compiles to the exact same assembly - register names and all.


This above version finally reduces to this:

int fast_trunc_one(int i) {
    int mantissa, exponent;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);

    if (exponent < 0) {
        return (mantissa << -exponent);             /* diff */
    } else {
        return (mantissa >> exponent);              /* diff */
    }
}

which is pretty much exactly what GCC generates in the assembly.


So why doesn't the compiler optimize fast_trunc_two() to the same thing?

The key part in fast_trunc_one() is the x + (y ^ x) = y optimization. In fast_trunc_two() the x + (y ^ x) expression is being split across the branch.

I suspect that might be enough to confuse GCC to not make this optimization. (It would need to hoist the ^ -sign out of the branch and merge it into the r + sign at the end.)

For example, this produces the same assembly as fast_trunc_one():

int fast_trunc_two(int i) {
    int mantissa, exponent, sign, r;

    mantissa = (i & 0x07fffff) | 0x800000;
    exponent = 150 - ((i >> 23) & 0xff);
    sign = i & 0x80000000;

    if (exponent < 0) {
        r = ((mantissa << -exponent) ^ -sign) + sign;             /* diff */
    } else {
        r = ((mantissa >> exponent) ^ -sign) + sign;              /* diff */
    }

    return r;                                     /* diff */
}

c++: how to optimize IO?

17 votes

I am working on a mathematical problem that has the advantage of being able to "pre-compute" about half of the problem, save this information to file, and then reuse it many times to compute various 'instances' of my problem. The difficulty is that uploading all of this information in order to solve the actual problem is a major bottleneck.

More specifically: I can pre-compute a huge amount of information - tons of probabilities (long double), a ton of std::map<int,int>, and much more - and save all this stuff to disk (several Gb).

The second half of my program accepts an input argument D. For each D, I need to perform a great many computations that involve a combination of the pre-computed data (from file), and some other data that are specific to D (so that the problem is different for each D).

Sometimes I will need to pick out certain pieces of pre-computed information from the files. Other times, I will need to upload every piece of data from a (large) file.

Are there any strategies for making the IO faster?

I already have the program parallelized (MPI, via boost::mpi) for other reasons, but regardless, accessing files on the disk is making my compute time unbearable.

Any strategies or optimizations?

Currently I am doing everything with cstdio, i.e. no iostream. Will that make a big difference?

The stuff that isn't in a map is easy. You put everything in one contiguous chunk of memory that you know (like a big array, or a struct/class with no pointers), and then use write() to write it out. Later use read() to read it in, in a single operation. If the size might vary, then use one operation to read a single int with the size, allocate the memory, and then use a single read() to pull it in.

The map part is a bit harder, since you can't do it all in one operation. Here you need to come up with a convention for serializing it. To make the i/o as fast as possible, your best bet is to convert it from the map to an in-memory form that is all in one place and you can convert back to the map easily and quickly. If, for example your keys are ints, and your values are of constant size then you could make an array of keys, and an array of values, copy your keys into the one array and values into the other, and then write() the two arrays, possibly writing out their size as well. Again, you read things in with only two or three calls to read().

Note that nothing ever got translated to ASCII, and there are a minimum number of system calls. The file will not be human readable, but it will be compact, and fast to read in. Three things make i/o slow: 1) system calls, if you use small reads/writes; 2) translation to/from ASCII (printf, scanf); 3) disk speed. Hard to do much about 3) (other than an SSD). You can do the read in a background thread, but you might need to block waiting for the data to be in.

How do you count cardinality of very large datasets efficiently in Python?

12 votes

I have been playing at work with some very very large sets of data, typically several billions of elements, that are all maintained in a memcached cloud and periodically dumped into files, and for one of my tasks I'm trying to count the cardinality of this set.

For some context, each item contains an IP and some other attributes identifying a person and is encoded in base64, item size is 20 bytes. Reducing the size of an item by removing some fields is not a possibility.

Here is something that emulates my dataset as an in-memory version (thanks to this post for string generation):

import base64, os

dataset_size = 10000000000 # that's 10 billion, be careful if you run it !
big_dataset = [base64.b64encode(os.urandom(10)) for i in range(dataset_size)]

My first approach was to use a hashset like this:

uniques = set(big_dataset)
print "Cardinality: %d" % len(uniques)

While this in theory works fine on a small dataset, as you can guess there is a hiccup:

  • I can't make any assumption on the uniqueness of my data. I could have 50% of my dataset that is unique, or I could have 100% just as well. This is generated dynamically at regular time intervals and varies depending on a lot of factors (time of day for example)
  • Dataset size in 10 billion. Each item encoded in base 64 is 20 bytes, times 10 billion is a few hundredids gigabytes in average. Unfortunately, I don't have access to a machine with that much RAM !

I've done my homework and found at best some research papers, or some obscure libraries, but part of the goal of this is to understand what approach works and why.

So I'm calling to you Python users, do you know of any algorithm that would help me estimate cardinality efficiently? By complexity I mean I don't care that much about running time complexity, but I'm more focused about space complexity. I don't mind sacrificing a bit of accuracy if it boosts performance tremendously (so I don't necessarily need to know the exact number of uniques, even if that would be ideal, but probably not a viable approach). I would say up to 5% would be acceptable. I'm looking for something specifically in Python for this project.

Thanks for any help you can provide !

As some people noted, I could use Hadoop/MR, but for this specific projects we don't want to go the MR way, and would like to explore algorithms to do this on a single machine efficiently, as this could be applied to a few other different projects.

I would recommend the usage of Hash Sketches, namely (Super)Log Log sketches or Hyper Log Sketches.

You can check and perhaps use and improve the simple python implementation that I made: https://github.com/goncalvesnelson/Log-Log-Sketch

How to prepare Django for a possible slashdotting?

4 votes

I would like to prepare my website for a possible influx in traffic. This is my first time using Django as a framework, so I'm unsure of the modifications that should be made to assure that I'm ready and won't go down. What are some of the common things one can do to prepare a Django website for production-level traffic?

I'm also wondering what to expect in terms of traffic numbers. I'm currently hosted at Webfaction with 600GB/month of traffic. Will this quickly run out? Are there statistics on how big 'slashdotted' events are?

  1. Use memcache and caching middleware.
  2. Be sure to offload serving statics.
  3. Use CDN for statics. This doesn't directly affect Django, but will reduce your network traffic.

Anything beyond that — read up what others are using: