Best performance questions in April 2012

32 votes

I have a multiply-add kernel inside my application and I want to increase its performance.

I use an Intel Core i7-960 (3.2 GHz clock) and have already manually implemented the kernel using SSE intrinsics as follows:

 for(int i=0; i<iterations; i+=4) {
    y1 = _mm_set_ss(output[i]);
    y2 = _mm_set_ss(output[i+1]);
    y3 = _mm_set_ss(output[i+2]);
    y4 = _mm_set_ss(output[i+3]);

    for(k=0; k<ksize; k++){
        for(l=0; l<ksize; l++){
            w  = _mm_set_ss(weight[i+k+l]);

            x1 = _mm_set_ss(input[i+k+l]);
            y1 = _mm_add_ss(y1,_mm_mul_ss(w,x1));
            …
            x4 = _mm_set_ss(input[i+k+l+3]);
            y4 = _mm_add_ss(y4,_mm_mul_ss(w,x4));
        }
    }
    _mm_store_ss(&output[i],y1);
    _mm_store_ss(&output[i+1],y2);
    _mm_store_ss(&output[i+2],y3);
    _mm_store_ss(&output[i+3],y4);
 }

I know I can use packed fp vectors to increase the performance and I already did so succesfully, but I want to know why the single scalar code isn't able to meet the processor's peak performance.

The performance of this kernel on my machine is ~1.6 FP operations per cycle, while the maximum would be 2 FP operations per cycle (since FP add + FP mul can be executed in parallel).

If I'm correct from studying the generated assembly code, the ideal schedule would look like follows, where the mov instruction takes 3 cycles, the switch latency from the load domain to the FP domain for the dependent instructions takes 2 cycles, the FP multiply takes 4 cycles and the FP add takes 3 cycles. (Note that the dependence from the multiply -> add doesn't incur any switch latency because the operations belong to the same domain).

schedule

According to the measured performance (~80% of the maximum theoretical performance) there is an overhead of ~3 instructions per 8 cycles.

I am trying to either:

  • get rid of this overhead, or
  • explain where it comes from

Of course there is the problem with cache misses & data misalignment which can increase the latency of the move instructions, but are there any other factors that could play a role here? Like register read stalls or something?

I hope my problem is clear, thanks in advance for your responses!


Update: The assembly of the inner-loop looks as follows:

...
Block 21: 
  movssl  (%rsi,%rdi,4), %xmm4 
  movssl  (%rcx,%rdi,4), %xmm0 
  movssl  0x4(%rcx,%rdi,4), %xmm1 
  movssl  0x8(%rcx,%rdi,4), %xmm2 
  movssl  0xc(%rcx,%rdi,4), %xmm3 
  inc %rdi 
  mulss %xmm4, %xmm0 
  cmp $0x32, %rdi 
  mulss %xmm4, %xmm1 
  mulss %xmm4, %xmm2 
  mulss %xmm3, %xmm4 
  addss %xmm0, %xmm5 
  addss %xmm1, %xmm6 
  addss %xmm2, %xmm7 
  addss %xmm4, %xmm8 
  jl 0x401b52 <Block 21> 
...

I noticed in the comments that:

  • The loop takes 5 cycles to execute.
  • It's "supposed" to take 4 cycles. (since there's 4 adds and 4 mulitplies)

However, your assembly shows 5 SSE movssl instructions. According to Agner Fog's tables all floating-point SSE move instructions are at least 1 inst/cycle reciprocal throughput for Nehalem.

Since you have 5 of them, you can't do better than 5 cycles/iteration.


So in order to get to peak performance, you need to reduce the # of loads that you have. How you can do that I can't see immediately this particular case - but it might be possible.

One common approach is to use tiling. Where you add nesting levels to improve locality. Although it's used mostly for improving cache access, it can also be used in registers to reduce the # of load/stores that are needed.

Ultimately, your goal is to reduce the number of loads to be less than the numbers of add/muls. So this might be the way to go.

Fast calculation of min, max, and average of incoming numbers

24 votes

Program is receiving approximately 50,000 numbers every second.

At ANY given moment, I need to calculate minimum, maximum and average of the values (numbers) that arrived in the last second (regarding to given moment).

Is there a way to do this without using array or list (buffer) to store arriving numbers and to calculate results?

If I need to use buffer, what would be the efficient way to achieve this?

(Note that numbers from buffer also must be efficiently removed from time to time)

Here is an algorithm that will somewhat work to save efficiency in certain cases:

  1. As events come in, buffer them completely, and calculate a running sum, count, min, max (trivial).

  2. When a request for average, min, or max is made, loop through from the back of the buffer and start removing values older than one second. Subtract from sum and count as you go.

    • If the values are all above min you can keep your min. If the values are below max, you can keep your max. In this scenario, you have average, min, and max updated efficiently.

    • If the values are below min or above max, you'll need to loop through the rest of the array and recalculate it.

  3. Do step two once a second or so also so that the buffer doesn't get too full. This code could be performed on every buffer insert also, or wherever made sense.

Best structure for this kind of work is a circular buffer, to avoid memory allocations and GC getting in the way. It should be large enough to cover the worst case scenario for message size per second.

Updates

@ja72 provided a great idea to save on finding the min and max values if they are invalidated:

Instead of keeping the min/max values x_min, x_max keep instead the index of where they are located in the x[i] array with i_min and i_max. Then finding them can be trivial sometimes, but when the last value considered holds the min and max, the entire list needs to be scanned to establish the new limits.


Sam Holder had another good idea in the comments - keep a parallel array that is always sorted, this lets you lop numbers off the top or bottom to find new minimums and maximums easier. However, insert speed here is compromised a little (needs to remain in order).


Ultimately, the right choice will depend on the usage characteristics of the program. How often will values be read vs how often they are inserted?

Why is creating a class in Python so much slower than instantiating a class?

17 votes

I found that creation of a class is way slower than instantiation of a class.

>>> from timeit import Timer as T
>>> def calc(n):
...     return T("class Haha(object): pass").timeit(n)

<<After several these 'calc' things, at least one of them have a big number, eg. 100000>>

>>> calc(9000)
15.947055101394653
>>> calc(9000)
17.39099097251892
>>> calc(9000)
18.824054956436157
>>> calc(9000)
20.33335590362549

Yeah, create 9000 classes took 16 secs, and becomes even slower in the subsequent calls.

And this:

>>> T("type('Haha', b, d)", "b = (object, ); d = {}").timeit(9000)

gives similar results.

But instantiation don't suffer:

>>> T("Haha()", "class Haha(object): pass").timeit(5000000)
0.8786070346832275

5000000 instances in less than one sec.

What makes the creation this expensive?

And why the creation process become slower?

EDIT:

How to reproduce:

start a fresh python process, the initial several "calc(10000)"s give a number of 0.5 on my machine. And try some bigger values, calc(100000), it can't end in even 10secs, interrupt it, and calc(10000), gives a 15sec.

EDIT:

Additional fact:

If you gc.collect() after 'calc' becomes slow, you can get the 'normal' speed at beginning, but the timing will increasing in subsequent calls

>>> from a import calc
>>> calc(10000)
0.4673938751220703
>>> calc(10000)
0.4300072193145752
>>> calc(10000)
0.4270968437194824
>>> calc(10000)
0.42754602432250977
>>> calc(10000)
0.4344758987426758
>>> calc(100000)
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "a.py", line 3, in calc
    return T("class Haha(object): pass").timeit(n)
  File "/usr/lib/python2.7/timeit.py", line 194, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
KeyboardInterrupt
>>> import gc
>>> gc.collect()
234204
>>> calc(10000)
0.4237039089202881
>>> calc(10000)
1.5998330116271973
>>> calc(10000)
4.136359930038452
>>> calc(10000)
6.625348806381226

This might give you the intuition:

>>> class Haha(object): pass
...
>>> sys.getsizeof(Haha)
904
>>> sys.getsizeof(Haha())
64

Class object is much more complex and expensive structure that an instance of that class.

Can some explain the performance behavior of the following memory allocating C program?

14 votes

On my machine Time A and Time B swap depending on whether A is defined or not (which changes the order in which the two callocs are called).

I initially attributed this to the paging system. Weirdly, when mmap is used instead of calloc, the situation is even more bizzare -- both the loops take the same amount of time, as expected. As can be seen with strace, the callocs ultimately result in two mmaps, so there is no return-already-allocated-memory magic going on.

I'm running Debian testing on an Intel i7.

#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>

#include <time.h>

#define SIZE 500002816

#ifndef USE_MMAP
#define ALLOC calloc
#else
#define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE,  \
                          MAP_PRIVATE | MAP_ANONYMOUS, -1, 0))
#endif

int main() {
  clock_t start, finish;
#ifdef A
  int *arr1 = ALLOC(sizeof(int), SIZE);
  int *arr2 = ALLOC(sizeof(int), SIZE);
#else
  int *arr2 = ALLOC(sizeof(int), SIZE);
  int *arr1 = ALLOC(sizeof(int), SIZE);
#endif
  int i;

  start = clock();
  {
    for (i = 0; i < SIZE; i++)
      arr1[i] = (i + 13) * 5;
  }
  finish = clock();

  printf("Time A: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);

  start = clock();
  {
    for (i = 0; i < SIZE; i++)
      arr2[i] = (i + 13) * 5;
  }
  finish = clock();

  printf("Time B: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);

  return 0;
}

The output I get:

 ~/directory $ cc -Wall -O3 bench-loop.c -o bench-loop
 ~/directory $ ./bench-loop 
Time A: 0.94
Time B: 0.34
 ~/directory $ cc -DA -Wall -O3 bench-loop.c -o bench-loop
 ~/directory $ ./bench-loop                               
Time A: 0.34
Time B: 0.90
 ~/directory $ cc -DUSE_MMAP -DA -Wall -O3 bench-loop.c -o bench-loop
 ~/directory $ ./bench-loop                                          
Time A: 0.89
Time B: 0.90
 ~/directory $ cc -DUSE_MMAP -Wall -O3 bench-loop.c -o bench-loop 
 ~/directory $ ./bench-loop                                      
Time A: 0.91
Time B: 0.92

Short Answer

The first time that calloc is called it is explicitly zeroing out the memory. While the next time that it is called it assumed that the memory returned from mmap is already zeroed out.

Details

Here's some of the things that I checked to come to this conclusion that you could try yourself if you wanted:

  1. Insert a calloc call before your first ALLOC call. You will see that after this the Time for Time A and Time B are the same.

  2. Use the clock() function to check how long each of the ALLOC calls take. In the case where they are both using calloc you will see that the first call takes much longer than the second one.

  3. Use time to time the execution time of the calloc version and the USE_MMAP version. When I did this I saw that the execution time for USE_MMAP was consistently slightly less.

  4. I ran with strace -tt -T which shows both the time of when the system call was made and how long it took. Here is part of the output:

Strace output:

21:29:06.127536 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff806fd000 <0.000014>
21:29:07.778442 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff093a0000 <0.000021>
21:29:07.778563 times({tms_utime=63, tms_stime=102, tms_cutime=0, tms_cstime=0}) = 4324241005 <0.000011>

You can see that the first mmap call took 0.000014 seconds, but that about 1.5 seconds elapsed before the next system call. Then the second mmap call took 0.000021 seconds, and was followed by the times call a few hundred microsecond later.

I also stepped through part of the application execution with gdb and saw that the first call to calloc resulted in numerous calls to memset while the second call to calloc did not make any calls to memset. You can see the source code for calloc here (look for __libc_calloc) if you are interested. As for why calloc is doing the memset on the first call but not subsequent ones I don't know. But I feel fairly confident that this explains the behavior you have asked about.

As for why the array that was zeroed memset has improved performance my guess is that it is because of values being loaded into the TLB rather than the cache since it is a very large array. Regardless the specific reason for the performance difference that you asked about is that the two calloc calls behave differently when they are executed.

Mysql help needed to optimize group by sub query

8 votes

Hi I seem to be a little stuck. Its quite a straight forward query.

If I run the quieries seperately it is not that slow but when I combine them its very slow.

I'm not sure how to optimise it. Any help would be much appreciated. Im basically only wanting to show multiple refunds. So where faultid exists more than once.

SELECT 
  r.* 
FROM faultrefunds_v2 r
WHERE r.id IN 
( 
    SELECT
    r1.id 
    FROM 
    faultrefunds_v2 r1 
    GROUP BY faultid
    HAVING count(r1.faultid) > 1
);

The results from explain are have been attached as an image

enter image description here

I guess, this qualifies rather as a re-writing than as an optimisation, but this is what I would try instead, anyway:

SELECT 
  r.* 
FROM faultrefunds_v2 r
WHERE EXISTS (
  SELECT *
  FROM faultrefunds_v2 r1 
  WHERE r1.faultid = r.faultid
    AND r1.id <> r.id
);

PhoneGap 1.4 wrapping Sencha Touch 2.X - What about performance?

8 votes

I'm building a multiplatform tablet app wrapping it with Phonegap 1.4 using just its webview, then I work my magic with the Sencha Touch 2 framework. By multiplatform I mean iOS 5.X+ and Android 3.0+ (for now).

This app is working great so far, all its features work on both systems but... On the Android tablet (Samsung GalaxyTab) its really slow. What's happening? Can I do something about it, or its just android's limit?

Thanks


*EDIT* (I'm trying to make this post somewhat useful to the comunity)

Sencha Touch, like many other Javascript Frameworks are not the best example of performance due to javascript itself.

Then Why use Senha Touch?

  • In my case: Multiplatform (iOS, Android, Windows Phone, Blackberry, Windows, Mac OSX, Linux. Sharing 80-90% of the code)

Mitigating performance issues due to lack of visual pre-process in Android systems:

  1. CSS3 heavy visual process:

    • Avoid Gradients
    • Avoid Shadows
    • Avoid Transformations and animations
  2. Good MVC practices:

    • Don't use more views and you actually showing
    • Pre-render / Pre-datafetch when possible to avoid render and data process simultaneously

Same here. I've tested many of my Sencha Touch 2 applications on Samsung GalaxyTab and the performance is really terrible. There's a fact (which maybe a part of actual reason) that, iOS does many pre-process and calculation before rendering to make it seems smoother to user's look and feel, while Android tends to render & process simultaneously on the go.

In general, it could be say that, to every cross-platform mobile apps built on Javascript, like Sencha Touch, iOS performance is significantly better than Android. However, Sencha Touch dev team is trying their best to improve this, hopefully it would be better in next releases. You could see this article about iOS & Android devices performance comparison.

http://www.sencha.com/blog/sencha-touch-2-developer-preview/

PS: While it's much relevant to the OS's limit, you can also optimize your app to make it perform better on Android devices. To my experience, the best practice is:

  • Do NOT use CSS3 too much.
  • Keep your DOM as minimal as possible.

Hope it helps.

Why does jQuery not provide a .firstChild method?

8 votes

I have seen plenty of discussion regarding the fastest way to select a first child element using jQuery. As can be expected, the native DOM firstChild property is much faster than using a jQuery selector or combination of selectors -- see http://jsperf.com/jquery-first-child-selection-performance/6. This is not usually a problem -- either it's being used in a place where performance is not a big deal, or it's easy enough to just access the DOM element and use its .firstChild property. However, there are a couple problems with this:

  • firstChild could return a text or comment node, rather than an element, as a jQuery selector would return
  • If I need to select the first child of multiple elements, I must either use a slow selector, or go to a lot of extra work iterating over DOM elements, adding them to a collection, then putting them back into a jQuery object.

It seems to me that the cost of adding a firstChild method to the core jQuery library would be far smaller than the benefits. I took my own shot at creating such a method for my own use:

$.fn.firstChild = function() {
    var ret = [];

    this.each(function(){
        var el = this.firstChild;

        //the DOM firstChild property could return a text node or comment instead of an element
        while (el && el.nodeType != 1)
            el = el.nextSibling;

            if (el) ret.push(el);
        });

        //maintain jQuery chaining and end() functionality
        return this.pushStack(ret);
    };
}

In the tests i created at http://jsperf.com/jquery-multiple-first-child-selection, this function performs more than five times faster than any other option. The tests are based on the tests mentioned above, but are selecting the first children of multiple elements, rather than a single element.

Is there something I am missing? A technique that I should be using? Or is this an issue than one should never worry about? Is there a reason to not include a function like this in jQuery?

"Why does jQuery not provide a .firstChild method?"

Feature creep, most likely.

It can be accomplished with other methods as you stated, and if performance is a concern, you can extend jQuery for that specific need as you've done.


You can improve performance in your code a little more...

$.fn.firstChild = function () {
    var ret = [];
    // use a for loop
    for (var i = 0, len = this.length; i < len; i++) {
        var this_el = this[i],
            el = this_el.firstElementChild; // try firstElementChild first
        if (!el) {
            el = this_el.firstChild;
            while (el && el.nodeType != 1)
                el = el.nextSibling;
        }
        if (el) ret.push(el);
    }
    //maintain jQuery chaining and end() functionality
    return this.pushStack(ret);
};

Execution Time Slower with each Iteration of the same SPROC

6 votes

Running the same Stored Procedure from C# .Net application over a network gets progressively slower with each subsequent execution. It appears to take twice the amount of time as the previous execution (up to a max value; read on). The execution time becomes progressively slower until 1 of 2 scenarios happens, at which point the first execution of the SPROC is "fast" again.

  1. If an SqlConnection is opened and remains open during all testing, the SPROC gets progressively slower until any other SPROC or query is run.
  2. If an SqlConnection is opened and closed around each execution, the SPROC gets progressively slower until at least 8 minutes has passed.

This only happens with a few Stored Procedures. One is a simple SELECT query with 2 JOINs, (SPROC 1) another is a massive 1600+ line SPROC (SPROC 2).

The execution times appear to never go beyond exactly 60 seconds for SPROC 1 and 67 seconds for SPROC 2. SPROC 1 takes less than a second to execute initially, and SPROC 2 takes 7 seconds initially.

This also only happens if the SPROC is run using the same SqlConnection in the application. As soon as 2 separate SqlConnection objects are used, they behave the same as stated above, but are independent. Running the SPROC multiple times on SqlConnection1 gets progressively slower, but the first time the same SPROC is run on SqlConnection2, it's "fast". It will then also get slower when run multiple times on SqlConnection2.

This does not happen if the application is run on the same computer with SQL Server 2008 R2 installed (running Windows Server 2008). The execution time is always consistent.

Running the Stored Procedure from within Management Studio also does not get slower with each execution; it is always consistent.

Clearing the execution plan cache (in SQL Server) has no effect on the observed behavior.

It has taken quite a few days to narrow down this issue originally observed in a much larger application, in order to create a test app to easily test/reproduce it.

From what I've read here, there is a timeout of between 4 and 8 minutes (after SqlConnection.Close() is called in code) at which point it closes the database connection to the data source. This appears to be in line with the scenario 2 I mentioned above.

This leads me to believe it is related to the SqlConnection used (and the underlying database connection to the data source) since connection pooling is enabled in my case, but why am I observing this behavior, and how do I fix it?

We are using the .Net 2.0 Framework, if that makes any difference.

There are many fine details listed above, so please let me know if I need to clarify anything.

The only Stack Overflow question with any similarities is this, but was unrelated to my issue.

Edit: The following code is executed in my WinForms test app on startup:

SqlConnectionStringBuilder connectionStringBuilder = new SqlConnectionStringBuilder();

connectionStringBuilder.DataSource = m_DataSource;
connectionStringBuilder.InitialCatalog = m_InitialCatalog;
connectionStringBuilder.UserID = m_UserID;
connectionStringBuilder.Password = m_Password;
connectionStringBuilder.IntegratedSecurity = false;
connectionString = connectionStringBuilder.ConnectionString;

m_DatabaseConnection = new SqlConnection(connectionString);

I then have 2 buttons; one of which calls SPROC 1 mentioned above, and the other calls a different SPROC which does not have the same slowdown issue. The following code is executed on either button click (only difference being the SPROC name):

m_DatabaseConnection.Open();
m_DatabaseCommand = new SqlCommand("GetCompanies", m_DatabaseConnection);
m_DatabaseCommand.Parameters.AddWithValue("@StatusID", StatusID);
m_DatabaseCommand.CommandType = CommandType.StoredProcedure;
m_DatabaseCommand.CommandTimeout = 0;

SqlDataAdapter databaseDataAdapter = new SqlDataAdapter(m_DatabaseCommand);
DataSet databaseDataSet = new DataSet();
databaseDataAdapter.Fill(databaseDataSet);
m_DatabaseConnection.Close();

Here are my ideas to debug this problem:

  • Try calling SqlConnection.ClearAllPools() after Disposing the connection. If this fixes the problem, the problem is tied to a particular connection for sure.
  • Next, enclose the SPROC in an explicit transaction.
  • Next, call SqlConnection.ClearAllPools() before invoking the SPROC.
  • How much data does the SPROC return?
  • Please post the C# code you are using to open the connection and execute the SPROC.
  • Create a standalone console app that is reproducing the behavior you are seeing. This will (likely) prove that something in your app is the problem because the console app will run just fine.

How to prepare Django for a possible slashdotting?

4 votes

I would like to prepare my website for a possible influx in traffic. This is my first time using Django as a framework, so I'm unsure of the modifications that should be made to assure that I'm ready and won't go down. What are some of the common things one can do to prepare a Django website for production-level traffic?

I'm also wondering what to expect in terms of traffic numbers. I'm currently hosted at Webfaction with 600GB/month of traffic. Will this quickly run out? Are there statistics on how big 'slashdotted' events are?

  1. Use memcache and caching middleware.
  2. Be sure to offload serving statics.
  3. Use CDN for statics. This doesn't directly affect Django, but will reduce your network traffic.

Anything beyond that — read up what others are using: