Best performance questions in June 2012

Why is processing a sorted array faster than an unsorted array?

1088 votes

Here is a piece of code that shows some very peculiar performance. For some strange reason, sorting the data miraculously speeds up the code by almost 6x:

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;


    // !!! with this, the next loop runs faster
    std::sort(data, data + arraySize);


    // test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}
  • Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
  • With the sorted data, the code runs in 1.93 seconds.

Initially I thought this might be just a language or compiler anomaly. So I tried it Java...

import java.util.Arrays;
import java.util.Random;

public class Main
{
    public static void main(String[] args)
    {
        // generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c)
            data[c] = rnd.nextInt() % 256;


        // !!! with this, the next loop runs faster
        Arrays.sort(data);


        // test
        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i)
        {
            // primary loop
            for (int c = 0; c < arraySize; ++c)
            {
                if (data[c] >= 128)
                    sum += data[c];
            }
        }

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);
    }
}

with a similar but less extreme result.


My first thought was that sorting brings the data into cache, but my next thought was how silly that is because the array was just generated.

What is going on? Why is a sorted array faster than an unsorted array? The code is summing up some independent terms, the order should not matter.

You are the victim of branch prediction fail.


What is Branch Prediction?

Consider a railroad junction:

enter image description here Image by Mecanismo, from Wikimedia Commons: http://commons.wikimedia.org/wiki/File:Entroncamento_do_Transpraia.JPG

Now for the sake of argument, suppose this is back in the 1800s - before long distance or radio communication.

You are the operator of a junction and you hear a train coming. You have no idea which way it will go. You stop the train to ask the captain which direction he wants. And then you set the switch appropriately.

Trains are heavy and have a lot of momentum. So they take forever to start up and slow down.

Is there a better way? You guess which direction the train will go!

  • If you guessed right, it continues on.
  • If you guessed wrong, the captain will stop, back up, and yell at you to flip the switch. Then it can restart down the other path.

If you guess right every time, the train will never have to stop.
If you guess wrong too often, the train will spend a lot of time stopping, backing up, and restarting.


Consider an if-statement: At the processor level, it is a branch instruction:

enter image description here

You are a processor and you see a branch. You have no idea which way it will go. What do you do? You halt execution and wait until the previous instructions are complete. Then you continue down the correct path.

Modern processors are complicated and have long pipelines. So they take forever to "warm up" and "slow down".

Is there a better way? You guess which direction the branch will go!

  • If you guessed right, you continue executing.
  • If you guessed wrong, you need to flush the pipeline and roll back to the branch. Then you can restart down the other path.

If you guess right every time, the execution will never have to stop.
If you guess wrong too often, you spend a lot of time stalling, rolling back, and restarting.


This is branch prediction. I admit it's not the best analogy since the train could just signal the direction with a flag. But in computers, the processor doesn't know which direction a branch will go until the last moment.

So how would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every 3 times, you guess the same...

In other words, you try to identify a pattern and follow it. This is more or less how branch predictors work.

Most applications have well-behaved branches. So modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

Further Reading: http://en.wikipedia.org/wiki/Branch_predictor


As hinted from above, the culprit is this if-statement:

if (data[c] >= 128)
    sum += data[c];

Notice that the data is evenly distributed between 0 and 255. When the data is sorted, roughly the first half of the iterations will not enter the if-statement. After that, they will all enter the if-statement.

This is very friendly to the branch predictor since the branch consecutively goes the same direction many times. Even a simple saturating counter will correctly predict the branch except for the few iterations after it switches direction.

Quick visualization:

T = branch taken
N = branch not taken

data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...

       = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT  (easy to predict)

However, when the data is completely random, the branch predictor is rendered useless because it can't predict random data. Thus there will probably be around 50% misprediction. (no better than random guessing)

data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, 177, 182, 133, ...
branch =   T,   T,   N,   T,   T,   T,   T,  N,   T,   N,   N,   T,   T,   T,   N  ...

       = TTNTTTTNTNNTTTN ...   (completely random - hard to predict)

So what can be done?

If the compiler isn't able to optimize the branch into a conditional move, you can try some hacks if you are willing to sacrifice readability for performance.

Replace:

if (data[c] >= 128)
    sum += data[c];

with:

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

This eliminates the branch and replaces it with some bitwise operations.

(Note that this hack is not strictly equivalent to the original if-statement. But in this case, it's valid for all the input values of data[].)

Benchmarks: Core i7 920 @ 3.5 GHz

C++ - Visual Studio 2010 - x64 Release

//  Branch - Random
seconds = 11.777

//  Branch - Sorted
seconds = 2.352

//  Branchless - Random
seconds = 2.564

//  Branchless - Sorted
seconds = 2.587

Java - Netbeans 7.1.1 JDK 7 - x64

//  Branch - Random
seconds = 10.93293813

//  Branch - Sorted
seconds = 5.643797077

//  Branchless - Random
seconds = 3.113581453

//  Branchless - Sorted
seconds = 3.186068823

Observations:

  • With the Branch: There is a huge difference between the sorted and unsorted data.
  • With the Hack: There is no difference between sorted and unsorted data.
  • In the C++ case, the hack is actually a tad slower than with the branch when the data is sorted.

A general rule of thumb is to avoid data-dependent branching in critical loops. (such as in this example)


Update :

  • GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

  • VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.

  • Intel Compiler 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune the mispredictions, it is also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

  • If you give the Intel Compiler the branchless code, it just out-right vectorizes it... and is just as fast as with the branch (with the loop interchange).

This goes to show the even the mature modern compilers can vary wildly in their ability to optimize code...

Performance difference: condition placed at INNER JOIN vs WHERE clause

10 votes

Say I have a table order as

id | clientid | type | amount | itemid | date
---|----------|------|--------|--------|-----------
23 | 258      | B    | 150    | 14     | 2012-04-03
24 | 258      | S    | 69     | 14     | 2012-04-03
25 | 301      | S    | 10     | 20     | 2012-04-03
26 | 327      | B    | 54     | 156    | 2012-04-04
  • clientid is a foreign-key back to the client table
  • itemid is a foreign key back to an item table
  • type is only B or S
  • amount is an integer

and a table processed as

id | orderid | processed | date
---|---------|-----------|---------
41 | 23      | true      | 2012-04-03
42 | 24      | true      | 2012-04-03
43 | 25      | false     | <NULL>
44 | 26      | true      | 2012-04-05     

I need to get all the rows from order that for the same clientid on the same date have opposing type values. Keep in mind type can only have one of two values - B or S. In the example above this would be rows 23 and 24.

The other constraint is that the corresponding row in processed must be true for the orderid.

My query so far

SELECT c1.clientid,
       c1.date,
       c1.type,
       c1.itemid,
       c1.amount,
       c2.date,
       c2.type,
       c2.itemid,
       c2.amount

FROM   order c1
INNER JOIN order c2 ON c1.itemid    =  c2.itemid AND
                       c1.date      =  c2.date   AND
                       c1.clientid  =  c2.clientid AND
                       c1.type     <>  c2.type AND
                       c1.id        <  c2.id

INNER JOIN processed p1 ON p1.orderid   =  c1.id AND
                         p1.processed =  true
INNER JOIN processed p2 ON p2.orderid   =  c2.id AND
                         p2.processed =  true

QUESTION: Keeping the processed = true as part of the join clause is slowing the query down. If I move it to the WHERE clause then the performance is much better. This has piqued my interest and I'd like to know why.

The primary keys and respective foreign key columns are indexed while the value columns (value, processed etc) aren't.

Disclaimer: I have inherited this DB structure and the performance difference is roughly 6 seconds.

The reason that you're seeing a difference is due to the execution plan that the planner is putting together, this is obviously different depending on the query (arguably, it should be optimising the 2 queries to be the same and this may be a bug). This means that the planner thinks it has to work in a particular way to get to the result in each statement.

When you do it within the JOIN, the planner will probably have to select from the table, filter by the "True" part, then join the result sets. I would imagine this is a large table, and therefore a lot of data to look through, and it can't use the indexes as efficiently.

I suspect that if you do it in a WHERE clause, the planner is choosing a route that is more efficient (ie. either index based, or pre filtered dataset).

You could probably make the join work as fast (if not faster) by adding an index on the two columns (not sure if included columns and multiple column indexes are supported on Postgres yet).

In short, the planner is the problem it is choosing 2 different routes to get to the result sets, and one of those is not as efficient as the other. It's impossible for us to know what the reasons are without the full table information and the EXPLAIN ANALYZE information.

If you want specifics on why your specific query is doing this, you'll need to provide more information. However the reason is the planner choosing different routes.

Additional Reading Material:

http://www.postgresql.org/docs/current/static/explicit-joins.html

Just skimmed, seems that the postgres planner doesn't re-order joins to optimise it. try changing the order of the joins in your statement to see if you then get the same performance... just a thought.

Consider the following two EXPLAINs:

EXPLAIN SELECT * FROM sales WHERE title != 'The'

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE      sales   ALL      title        NULL  NULL    NULL    41707   Using where

And -

EXPLAIN SELECT * FROM sales WHERE title = 'The'
id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE      sales   ref      title         title    767 const   1   Using where 

Why does the != query have a NULL key? Why doesn't it use title? What causes a = statement to be able to utilize an index but not a !=?

There is no point on using the index unless title is exactly 'The' very frequently.

Since almost every row needs to be selected you don't gain anything from using an index. It can actually be costly to use an index, which is probably what your MySQL engine is determining, so it is opting not to use the index.

Compare the amount of work done in these two situations:

Using the index:

1) Read the entire index tree into memory.
2) Search the index tree for the value 'The' and filter out those entries.
3) Read every row except for the few exceptions (which probably are in the same blocks on the disk as rows that do need to be read, so really the whole table is likely to be read in) from the table into memory.

Without the index:

1) Read every row into memory and while reading them filter out any where title = 'The' from the result set

How To Slow Down A SQL Query?

7 votes

As strange as it sounds I need to slow down a SQL query. Currently I'm using Microsoft SQL Server 2008 R2 on an in-house development server with the AdventureWorks database. I'm in the process of testing some code and the queries that I'm running are too fast no matter what I try!

Basically I'm testing a cut-off feature and need a sufficiently long query to be able to cut it off before it completes.

Unfortunately as it is a local installation there isn't a single query or large enough table in the AdventureWorks database to actually give me good data to work with. I've tried

WAITFOR DELAY '01:00'

Which worked great to just test to make sure it was working, but now I need to test to see if I can cut the data stream off mid-read. The WAITFOR statement doesn't do me justice in that respect because I need it to actively be retrieving data back from the server. My first intuition was to use convoluted calculations to slow it down, however even having SQL server multiply all the numerical values in the query by themselves 37 times only slowed down the query by milliseconds. The second thing I tried was embedding the WAITFOR statement in a sub-query but it appears you can't do that. Finally, the only thing I haven't tried is to execute multiple stored procedures and WAITFOR in between them, but I don't think that would work for what I need.

I have to say, I'm impressed at how hard it is to make an absolutely terrible query when you're this close to the server.

Is there any way I can slow down a query easily?

Thank you!

Just do a load of cross joins.

SELECT T1.*
FROM SomeTable T1,  
     SomeTable T2,  
     SomeTable T3,  
     SomeTable T4

For a 1,000 row table that will generate 1,000 billion rows which should be plenty slow enough.

Main table with hundreds vs few smaller

5 votes

I was wondering which approach is better for designing databases?

I have currently one big table (97 columns per row) with references to lookup tables where I could.

Wouldn't it be better for performance to group some columns into smaller tables and add them key columns for referencing one whole row?

If you split up your table into several parts, you'll need additional joins to get all your columns for a single row - that will cost you time.

97 columns isn't much, really - I've seen way beyond 100.

It all depends on how your data is being used - if your row just has 97 columns, all the time, and needs to 97 columns - then it really hardly ever makes sense to split those up into various tables.

It might make sense if:

  • you can move some "large" columns (like XML, VARCHAR(MAX) etc.) into a separate table, if you don't need those all the time -> in that case, your "basic" row becomes smaller and your basic table will perform better - as long as you don't need those extra large column

  • you can move away some columns to a separate table that aren't always present, e.g. columns that might be "optional" and only present for e.g. 20% of the rows - in that case, you might save yourself some processing for the remaining 80% of the cases where those columns aren't needed.

Python : How do you find the CPU consumption for a piece of code?

4 votes

Background:

I have a django application, it works and responds pretty well on low load, but on high load like 100 users/sec, it consumes 100% CPU and then due to lack of CPU slows down.

Problem :

  • Profiling the application gives me time taken by functions.
  • This time increases on high load.
  • Time consumed may be due to complex calculation or for waiting for CPU.

so, how to find the CPU cycles consumed by a piece of code ?

Since, reducing the CPU consumption will increase the response time.

  • I might have written extremely efficient code and need to add more CPU power

OR

  • I might have some stupid code taking the CPU and causing the slow down ?

Any help is appreciated !

Update:

  • I am using Jmeter to profile my webapp, it gives me a throughput of 2 requests/sec. [ 100 users]
  • I get a average time of 36 seconds on 100 request vs 1.25 sec time on 1 request.

More Info

  • Configuration Nginx + Uwsgi with 4 workers
  • No database used, using a responses from a REST API
  • On 1st hit the response of REST API gets cached, therefore doesn't makes a difference.
  • Using ujson for json parsing.

Curious to Know:

  • Python-Django is used by so many orgs for so many big sites, then there must be some high end Debug / Memory-CPU analysis tools.
  • All those I found were casual snippets of code that perform profiling.

You could try configuring your test to ramp up slowly, slow enough so that you can see the CPU gradually increase and then run the profiler before you hit high CPU. There's no point trying to profile code when the CPU is maxed out because at this point everything will be slow. In fact, you really only need a relatively light load to get useful data from a profiler.

Also, by gradually increasing the load you will be better able to see if there is a gradual increase in CPU (suggesting a CPU bottleneck) or if there is a sudden jump in CPU (suggesting perhaps another type of problem, one that would not necessarily be addressed by more CPU).

Try using something like a Cosntant Throughput Timer to pace the requests, this will prevent JMeter getting carried away and over-loading the system.