Best performance questions in December 2011

Why is one loop so much slower than two loops?

381 votes

Suppose a1, b1, c1, and d1 point to heap memory and my numerical code has the following core loop.

const int n=100000

for(int j=0;j<n;j++){
    a1[j] += b1[j];
    c1[j] += d1[j];
}

This loop is executed 10,000 times via another outer for loop. To speed it up, I changed the code to:

for(int j=0;j<n;j++){
    a1[j] += b1[j];
}
for(int j=0;j<n;j++){
    c1[j] += d1[j];
}

Compiled on MS Visual C++ 10.0 with full optimization and SSE2 enabled for 32-bit on a Intel Core 2 Duo (x64), the first example takes 5.5 seconds and the double-loop example takes only 1.9 seconds. My question is: (Please refer to the my rephrased question at the bottom)

PS: I am not sure, if this helps:

Disassembly for the first loop basically looks like this (this block is repeated about five times in the full program):

movsd       xmm0,mmword ptr [edx+18h]
addsd       xmm0,mmword ptr [ecx+20h]
movsd       mmword ptr [ecx+20h],xmm0
movsd       xmm0,mmword ptr [esi+10h]
addsd       xmm0,mmword ptr [eax+30h]
movsd       mmword ptr [eax+30h],xmm0
movsd       xmm0,mmword ptr [edx+20h]
addsd       xmm0,mmword ptr [ecx+28h]
movsd       mmword ptr [ecx+28h],xmm0
movsd       xmm0,mmword ptr [esi+18h]
addsd       xmm0,mmword ptr [eax+38h]

Each loop of the double loop example produces this code (the following block is repeated about three times):

addsd       xmm0,mmword ptr [eax+28h]
movsd       mmword ptr [eax+28h],xmm0
movsd       xmm0,mmword ptr [ecx+20h]
addsd       xmm0,mmword ptr [eax+30h]
movsd       mmword ptr [eax+30h],xmm0
movsd       xmm0,mmword ptr [ecx+28h]
addsd       xmm0,mmword ptr [eax+38h]
movsd       mmword ptr [eax+38h],xmm0
movsd       xmm0,mmword ptr [ecx+30h]
addsd       xmm0,mmword ptr [eax+40h]
movsd       mmword ptr [eax+40h],xmm0

EDIT: The question turned out to be of no relevance, as the behavior severely depends on the sizes of the arrays (n) and the CPU cache. So if there is further interest, I rephrase the question:

Could you provide some solid insight into the details that lead to the different cache behaviors as illustrated by the five regions on the following graph?

It might also be interesting to point out the differences between CPU/cache architectures, by providing a similar graph for these CPUs.

PPS: The full code is at http://pastebin.com/ivzkuTzG. It uses TBB Tick_Count for higher resolution timing, which can be disabled by not defining the TBB_TIMING Macro.

(It shows FLOP/s for different values of n.)

enter image description here

Upon further analysis of this, I believe this is (at least partially) caused by data alignment of the four pointers. This will cause some level of cache bank/way conflicts.

If I've guessed correctly on how you are allocating your arrays, they are likely to be aligned to the page line.

This means that all your accesses in each loop will fall on the same cache way. However, Intel processors have had 8-way L1 cache associativity for a while. But in reality, the performance isn't completely uniform. Accessing 4-ways is still slower than say 2-ways.

EDIT : It does in fact look like you are allocating all the arrays separately. Usually when such large allocations are requested, the allocator will request fresh pages from the OS. Therefore, there is a high chance that large allocations will appear at the same offset from a page-boundary.

Here's the test code:

int main(){
    const int n = 100000;

#ifdef ALLOCATE_SEPERATE
    double *a1 = (double*)malloc(n * sizeof(double));
    double *b1 = (double*)malloc(n * sizeof(double));
    double *c1 = (double*)malloc(n * sizeof(double));
    double *d1 = (double*)malloc(n * sizeof(double));
#else
    double *a1 = (double*)malloc(n * sizeof(double) * 4);
    double *b1 = a1 + n;
    double *c1 = b1 + n;
    double *d1 = c1 + n;
#endif

    //  Zero the data to prevent any chance of denormals.
    memset(a1,0,n * sizeof(double));
    memset(b1,0,n * sizeof(double));
    memset(c1,0,n * sizeof(double));
    memset(d1,0,n * sizeof(double));

    //  Print the addresses
    cout << a1 << endl;
    cout << b1 << endl;
    cout << c1 << endl;
    cout << d1 << endl;

    clock_t start = clock();

    int c = 0;
    while (c++ < 10000){

#if ONE_LOOP
        for(int j=0;j<n;j++){
            a1[j] += b1[j];
            c1[j] += d1[j];
        }
#else
        for(int j=0;j<n;j++){
            a1[j] += b1[j];
        }
        for(int j=0;j<n;j++){
            c1[j] += d1[j];
        }
#endif

    }

    clock_t end = clock();
    cout << "seconds = " << (double)(end - start) / CLOCKS_PER_SEC << endl;

    system("pause");
    return 0;
}

Benchmark Results:

EDIT: Results on an actual Core 2 architecture machine:

2 x Intel Xeon X5482 Harpertown @ 3.2 GHz:

#define ALLOCATE_SEPERATE
#define ONE_LOOP
00600020
006D0020
007A0020
00870020
seconds = 6.206

#define ALLOCATE_SEPERATE
//#define ONE_LOOP
005E0020
006B0020
00780020
00850020
seconds = 2.116

//#define ALLOCATE_SEPERATE
#define ONE_LOOP
00570020
00633520
006F6A20
007B9F20
seconds = 1.894

//#define ALLOCATE_SEPERATE
//#define ONE_LOOP
008C0020
00983520
00A46A20
00B09F20
seconds = 1.993

Observations:

  • 6.206 seconds with one loop and 2.116 seconds with two loops. This reproduces the OP's results exactly.

  • In the first two tests, the arrays are allocated separately. You'll notice that they all have the same alignment relative to the page.

  • In the second two tests, the arrays are packed together to break that alignment. Here you'll notice both loops are faster. Furthermore, the second (double) loop is now the slower one as you would normally expect.

As @Stephen Cannon points out in the comments, there is very likely possibility that this alignment causes false aliasing in the load/store units or the cache. I Googled around for this and found that Intel actually has a hardware counter for partial address aliasing stalls:

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/lin/ug_docs/reference/pmw_dp/events/partial_address_alias.html


5 Regions - Explanations

Region 1:

This one is easy. The dataset is so small that the performance is dominated by overhead like looping and branching.

Region 2:

Here, as the data sizes increases, the amount of relative overhead goes down and the performance "saturates". Here two loops is slower because it has twice as much loop and branching overhead.

I'm not sure exactly what's going on here... Alignment could still play an effect as Agner Fog mentions cache bank conflicts. (That link is about Sandy Bridge, but the idea should still be applicable to Core 2.)

Region 3:

At this point, the data no longer fits in L1 cache. So performance is capped by the L1 <-> L2 cache bandwidth.

Region 4:

The performance drop in the single-loop is what we are observing. And as mentioned, this is due to the alignment which (most likely) causes false aliasing stalls in the processor load/store units.

However, in order for false aliasing to occur, there must be a large enough stride between the datasets. This is why you don't see this in region 3.

Region 5:

At this point, nothing fits in cache. So you're bound by memory bandwidth.


2 x Intel X5482 Harpertown @ 3.2 GHz Intel Core i7 870 @ 2.8 GHz Intel Core i7 2600K @ 4.4 GHz

How did Firefox optimize this loop?

31 votes

Firefox 9.0.1 surprised me by showing up my Ω(log n) number-padding algorithm with a Ω(n) loop method when n is small. In every other browser I've seen, the loop is slower, even for small values of n. I know all the browsers are working on optimizing JS, but since all the other, modern browsers are showing the loop to be slower, is there any explanation for the behavior in Firefox 9?

// Ω(log n)
function padNumberMath(number, length) {
    var N = Math.pow(10, length);
    return number < N ? ("" + (N + number)).slice(1) : "" + number
}

// Ω(n):
function padNumberLoop(number, length) {
    var my_string = '' + number;
    while (my_string.length < length) {
        my_string = '0' + my_string;
    }
    return my_string;
}

Update: I don't think this is related to the original question, but I just discovered that IE 9 switches behavior when switching from 32- to 64-bit modes. In 32-bit mode, the Math method wins. In 64-bit mode, the Loop method wins. Just thought I should point that out.

Update 2: MAK caught me in his comment below. The math method's not Ω(1), it's probably more like Ω(log n).

Looking at the results its pretty clear that Firefox didn't do anything to achieve a performance gain.

browserscope

These bars can be read as "speeds" (operations/sec) so bigger bars are better. Everything is to scale.

In Firefox 9, its very clear the "Math" method performs abysmally, while there is little change in "Loop" method between versions.

So there was no "optimization" of any sort in Firefox 9. All that happened between Firefox 8 and 9 with regard to these tests is somehow their math library got slower (Math.pow being slow), or their string library got slower (.slice() being slow).

Looking into this further, it's clear somehow these elementary operations got a bit slower in ff9:

ff8 vs ff9

Both concatenation and Math.pow are a bit slower in FF 9 than in FF 8, which may account for the difference you see in your tests.

Interestingly, the new no-op bar is much longer in FF8 than FF9.

Appengine, performance degradation with python27

20 votes

I wanted to test python27 on appengine so I have migrated my app from python25. Performance got more than 2x slower for every request! Then I've returned to python25 and performance is again as it was before. Here is a picture:

enter image description here (milliseconds/request) (cgi handler python 27, then python25)

My app uses Werkzeug, Jinja2, and memcache is used quite alot. What reasons can cause such a dramatic decrease in performance? Or is it just because python2.7 on appengine is still in beta?

Some details about application:

It is quite simple online shop. There are some deferred tasks with pdf generation however these don't affect overall graph much because the front page gets most hits. Nearly everything is memcached. It takes up to ~0.8 sec with empty cache to load a page with python 2.5. Non-cached pages takes long to load mainly because there are many db queries. Cached pages load in 60~100 ms. Average load time is ~150 ms. With python 2.7 performance is terrible. Non-cached pages takes 2+ secs to load. Cached pages load in 200+ ms.

Unfortunately I don't have any profiling data and I can't tell what exactly slows things down in python 2.7.

My numbers for page-load time are collected from live page which serves ~10 req/sec and 1 resident python25 instance easily deals with this load.

I have also tested python 2.7 with wsgi and threadsafe:yes, but performance improved just a little compared to python 2.7 and cgi.

Somewhere on Usenet I read a statement like this from Google "he Python 2.7 runtime is slower than the Python 2.5 runtime in some cases and faster in others. We aren't publicizing the reasons why at this point.". Seems nobody has found so far a scenario where 2.7 is faster than 2.5 thus ...

I read into this

  1. Google is aware of a preformace issue.
  2. They not sure how to handle it.

My profiling indicates that python 2.7 multithreaded applications spend ca. 35 % of their time in {method 'acquire' of 'thread.lock' objects} - and seemingly that happens in googles RPC code. There are also indication that importing has serious locking issues.

Basically you can't do anything about it except waiting for the AppEngine to fix it. See also this comprehensive documentation about the slowdown.

Factors which almost certainly don't play a significant role in this are:

  • uploading pyc files (the GAE infrastructure does that for you)
  • the server provisioning (GAE is such a massive scale and even in closed beta 2.7. was slow)
  • WSGI vs. CGI (wouldn't explain such a huge performance hit)

The ugly thing is that Google did a massive price hike in two steps but told us "Py using multithreaded python 2.7 you can run so much more effective that the new prices don't look so bad". Unfortunately the Python 2.7. runtime is still labeled "experimental" and doesn't deliver production quality performance.

C# & .NET: stackalloc

19 votes

I have a few questions about the functionality of the stackalloc operator.

  1. How does it actually allocate? I thought it does something like:

    void* stackalloc(int sizeInBytes)
    {
        void* p = StackPointer (esp);
        StackPointer += sizeInBytes;
        if(StackPointer exceeds stack size)
            throw new StackOverflowException(...);
        return p;
    }
    

    But I have done a few tests, and I'm not sure that's how it work. We can't know exactly what it does and how it does it, but I want to know the basics.

  2. I thought that stack allocation (Well, I am actually sure about it) is faster than heap allocation. So why does this example:

     class Program
     {
         static void Main(string[] args)
         {
             Stopwatch sw1 = new Stopwatch();
             sw1.Start();
             StackAllocation();
             Console.WriteLine(sw1.ElapsedTicks);
    
             Stopwatch sw2 = new Stopwatch();
             sw2.Start();
             HeapAllocation();
             Console.WriteLine(sw2.ElapsedTicks);
         }
         static unsafe void StackAllocation()
         {
             for (int i = 0; i < 100; i++)
             {
                 int* p = stackalloc int[100];
             }
         }
         static void HeapAllocation()
         {
             for (int i = 0; i < 100; i++)
             {
                 int[] a = new int[100];
             }
         }
     }
    

gives the average results of 280~ ticks for stack allocation, and usually 1-0 ticks for heap allocation? (On my personal computer, Intel Core i7).

On the computer I am using now (Intel Core 2 Duo), the results make more sense that the previous ones (Probably because Optimize code was not checked in VS): 460~ ticks for stack allocation, and about 380 ticks for heap allocation.

But this still doesn't make sense. Why is it so? I guess that the CLR notices that we don't use the array, so maybe it doesn't even allocate it?

A case where stackalloc is faster:

 private static volatile int _dummy; // just to avoid any optimisations
                                         // that have us measuring the wrong
                                         // thing. Especially since the difference
                                         // is more noticable in a release build
                                         // (also more noticable on a multi-core
                                         // machine than single- or dual-core).
 static void Main(string[] args)
 {
     System.Diagnostics.Stopwatch sw1 = new System.Diagnostics.Stopwatch();
     Thread[] threads = new Thread[20];
     sw1.Start();
     for(int t = 0; t != 20; ++t)
     {
        threads[t] = new Thread(DoSA);
        threads[t].Start();
     }
     for(int t = 0; t != 20; ++t)
        threads[t].Join();
     Console.WriteLine(sw1.ElapsedTicks);

     System.Diagnostics.Stopwatch sw2 = new System.Diagnostics.Stopwatch();
     threads = new Thread[20];
     sw2.Start();
     for(int t = 0; t != 20; ++t)
     {
        threads[t] = new Thread(DoHA);
        threads[t].Start();
     }
     for(int t = 0; t != 20; ++t)
        threads[t].Join();
     Console.WriteLine(sw2.ElapsedTicks);
     Console.Read();
 }
 private static void DoSA()
 {
    Random rnd = new Random(1);
    for(int i = 0; i != 100000; ++i)
        StackAllocation(rnd);
 }
 static unsafe void StackAllocation(Random rnd)
 {
    int size = rnd.Next(1024, 131072);
    int* p = stackalloc int[size];
    _dummy = *(p + rnd.Next(0, size));
 }
 private static void DoHA()
 {
    Random rnd = new Random(1);
    for(int i = 0; i != 100000; ++i)
        HeapAllocation(rnd);
 }
 static void HeapAllocation(Random rnd)
 {
    int size = rnd.Next(1024, 131072);
    int[] a = new int[size];
    _dummy = a[rnd.Next(0, size)];
 }

Important differences between this code and that in the question:

  1. We have several threads running. With stack allocation, they are allocating in their own stack. With heap allocation, they are allocating from a heap shared with other threads.

  2. Larger sizes allocated.

  3. Different sizes allocated each time (though I seeded the random generator to make the tests more deterministic). This makes heap fragmentation more likely to happen, making heap allocation less efficient than with identical allocations each time.

As well as this, it's also worth noting that stackalloc would often be used as an alternative to using fixed to pin an array on the heap. Pinning arrays is bad for heap performance (not just for that code, but also for other threads using the same heap), so the performance impact would be even greater then, if the claimed memory would be in use for any reasonable length of time.

While my code demonstrates a case where stackalloc gives a performance benefit, that in the question is probably closer to most cases where someone might eagerly "optimise" by using it. Hopefully the two pieces of code together show that whole stackalloc can give a boost, it can also hurt performance a lot too.

Generally, you shouldn't even consider stackalloc unless you are going to need to use pinned memory for interacting with unmanaged code anyway, and it should be considered an alternative to fixed rather than an alternative to general heap allocation. Use in this case still requires caution, forethought before you start, and profiling after you finish.

Use in other cases could give a benefit, but it should be far down the list of performance improvements you would try.

Edit:

To answer part 1 of the question. Stackalloc is conceptually much as you describe. It obtains a chunk of the stack memory, and then returns a pointer to that chunk. It doesn't check the memory will fit as such, but rather if it attempts to obtain memory into the end of the stack - which is protected by .NET on thread creation - then this will cause the OS to return an exceptioin to the runtime, which it then turns into a .NET managed exception. Much the same happens if you just allocate a single byte in a method with infinite recursion - unless the call got optimised to avoid that stack allocation (sometimes possible), then a single byte will eventually add up to enough to trigger the stack overflow exception.

Why is LINQ .Where(predicate).First() faster than .First(predicate)?

18 votes

I am doing some performance tests and noticed that a LINQ expression like

result = list.First(f => f.Id == i).Property

is slower than

result = list.Where(f => f.Id == i).First().Property

This seems counter intuitive. I would have thought that the first expression would be faster because it can stop iterating over the list as soon as the predicate is satisfied, whereas I would have thought that the .Where() expression might iterate over the whole list before calling .First() on the resulting subset. Even if the latter does short circuit it should not be faster than using First directly, but it is.

Below are two really simple unit tests that illustrate this. When compiled with optimisation on TestWhereAndFirst is about 30% faster than TestFirstOnly on .Net and Silverlight 4. I have tried making the predicate return more results but the performance difference is the same.

Can any one explain why .First(fn) is slower than .Where(fn).First()? I see a similar counter intuitive result with .Count(fn) compared to .Where(fn).Count().

private const int Range = 50000;

private class Simple
{
   public int Id { get; set; }
   public int Value { get; set; }
}

[TestMethod()]
public void TestFirstOnly()
{
   List<Simple> list = new List<Simple>(Range);
   for (int i = Range - 1; i >= 0; --i)
   {
      list.Add(new Simple { Id = i, Value = 10 });
   }

   int result = 0;
   for (int i = 0; i < Range; ++i)
   {
      result += list.First(f => f.Id == i).Value;
   }

   Assert.IsTrue(result > 0);
}

[TestMethod()]
public void TestWhereAndFirst()
{
   List<Simple> list = new List<Simple>(Range);
   for (int i = Range - 1; i >= 0; --i)
   {
      list.Add(new Simple { Id = i, Value = 10 });
   }

   int result = 0;
   for (int i = 0; i < Range; ++i)
   {
      result += list.Where(f => f.Id == i).First().Value;
   }

   Assert.IsTrue(result > 0);
}

I got the same results: where+first was quicker than first.

As Jon noted, Linq uses lazy evaluation so the performance should be (and is) broadly similar for both methods.

Looking in Reflector, First uses a simple foreach loop to iterate through the collection but Where has a variety of iterators specialised for different collection types (arrays, lists, etc.). Presumably this is what gives Where the small advantage.

Is Java's System.arraycopy() efficient for small arrays?

16 votes

Is Java's System.arraycopy() efficient for small arrays, or does the fact that it's a native method make it likely to be substantially less efficient than a simple loop and a function call?

Do native methods incur additional performance overhead for crossing some kind of Java-system bridge?

Expanding a little on what Sid has written, it's very likely that System.arraycopy is just a JIT intrinsic; meaning that when code calls System.arraycopy, it will most probably be calling a JIT-specific implementation (once the JIT tags System.arraycopy as being "hot") that is not executed through the JNI interface, so it doesn't incur the normal overhead of native methods.

In general, executing native methods does have some overhead (going through the JNI interface, also some internal JVM operations cannot happen when native methods are being executed). But it's not because a method is marked as "native" that you're actually executing it using JNI. The JIT can do some crazy things.

Easiest way to check is, as has been suggested, writing a small benchmark, being careful with the normal caveats of Java microbenchmarks (warm up the code first, avoid code with no side-effects since the JIT just optimizes it as a no-op, etc).

Multiple INSERT statements vs. single INSERT with multiple VALUES

14 votes

I'm running a performance comparison between using 1000 INSERT statements:

INSERT INTO T_TESTS (TestId, FirstName, LastName, Age) 
   VALUES ('6f3f7257-a3d8-4a78-b2e1-c9b767cfe1c1', 'First 0', 'Last 0', 0)
INSERT INTO T_TESTS (TestId, FirstName, LastName, Age) 
   VALUES ('32023304-2e55-4768-8e52-1ba589b82c8b', 'First 1', 'Last 1', 1)
...
INSERT INTO T_TESTS (TestId, FirstName, LastName, Age) 
   VALUES ('f34d95a7-90b1-4558-be10-6ceacd53e4c4', 'First 999', 'Last 999', 999)

..versus using single INSERT statement with 1000 values:

INSERT INTO T_TESTS (TestId, FirstName, LastName, Age) 
VALUES 
('db72b358-e9b5-4101-8d11-7d7ea3a0ae7d', 'First 0', 'Last 0', 0),
('6a4874ab-b6a3-4aa4-8ed4-a167ab21dd3d', 'First 1', 'Last 1', 1),
...
('9d7f2a58-7e57-4ed4-ba54-5e9e335fb56c', 'First 999', 'Last 999', 999)

To my big surprise, the results are the opposite of what I thought:

  • 1000 INSERT statements: 290 msec.
  • 1 INSERT statement with 1000 VALUES: 2800 msec.

The test is executed directly in MSSQL Management Studio with SQL Server Profiler used for measurement (and I've got similar results running it from C# code using SqlClient, which is even more suprising considering all the DAL layers roundtrips)

Can this be reasonable or somehow explained? How come, a supposedly faster method results in 10 times (!) worse performance?

Thank you.

EDIT: Attaching execution plans for both: Exec Plans

Your plan shows the single inserts are using parameterised procedures (possibly auto parameterised) so parse/compile time for these should be minimal.

I thought I'd look into this a bit more though so set up a loop (script) and tried adjusting the number of VALUES clauses and recording the compile time.

I then divided the compile time by the number of rows to get the average compile time per clause. The results are below

Graph

Up until 250 VALUES clauses present the compile time / number of clauses has a slight upward trend but nothing to get too excited about.

Graph

But then there is a sudden change.

That section of the data is shown below.

+------+----------------+-------------+---------------+---------------+
| Rows | CachedPlanSize | CompileTime | CompileMemory | Duration/Rows |
+------+----------------+-------------+---------------+---------------+
|  245 |            528 |          41 |          2400 | 0.167346939   |
|  246 |            528 |          40 |          2416 | 0.162601626   |
|  247 |            528 |          38 |          2416 | 0.153846154   |
|  248 |            528 |          39 |          2432 | 0.157258065   |
|  249 |            528 |          39 |          2432 | 0.156626506   |
|  250 |            528 |          40 |          2448 | 0.16          |
|  251 |            400 |         273 |          3488 | 1.087649402   |
|  252 |            400 |         274 |          3496 | 1.087301587   |
|  253 |            400 |         282 |          3520 | 1.114624506   |
|  254 |            408 |         279 |          3544 | 1.098425197   |
|  255 |            408 |         290 |          3552 | 1.137254902   |
+------+----------------+-------------+---------------+---------------+

The cached plan size which had been growing linearly suddenly drops but CompileTime increases 7 fold and CompileMemory shoots up. As @Mikael Eriksson points out this is also the moment the plan changes from an auto parametrized one (with 1,0000 parameters) to a non parametrized one. Thereafter it seems to get linearly less efficient (in terms of number of value clauses processed in a given time) so maybe try restricting your batches to 250 and see how that compares.

Not sure why this should be. Presumably when it is compiling a plan for specific literal values it must perform some activity that does not scale linearly (such as sorting) when constructing the internal table of constants.

I tried to look at this in a debugger but the public symbols for my version of SQL Server 2008 don't seem to be available so instead I had to look at the equivalent UNION ALL construction in SQL Server 2005.

A typical stack trace is below

sqlservr.exe!FastDBCSToUnicode()  + 0xac bytes  
sqlservr.exe!nls_sqlhilo()  + 0x35 bytes    
sqlservr.exe!CXVariant::CmpCompareStr()  + 0x2b bytes   
sqlservr.exe!CXVariantPerformCompare<167,167>::Compare()  + 0x18 bytes  
sqlservr.exe!CXVariant::CmpCompare()  + 0x11f67d bytes  
sqlservr.exe!CConstraintItvl::PcnstrItvlUnion()  + 0xe2 bytes   
sqlservr.exe!CConstraintProp::PcnstrUnion()  + 0x35e bytes  
sqlservr.exe!CLogOp_BaseSetOp::PcnstrDerive()  + 0x11a bytes    
sqlservr.exe!CLogOpArg::PcnstrDeriveHandler()  + 0x18f bytes    
sqlservr.exe!CLogOpArg::DeriveGroupProperties()  + 0xa9 bytes   
sqlservr.exe!COpArg::DeriveNormalizedGroupProperties()  + 0x40 bytes    
sqlservr.exe!COptExpr::DeriveGroupProperties()  + 0x18a bytes   
sqlservr.exe!COptExpr::DeriveGroupProperties()  + 0x146 bytes   
sqlservr.exe!COptExpr::DeriveGroupProperties()  + 0x146 bytes   
sqlservr.exe!COptExpr::DeriveGroupProperties()  + 0x146 bytes   
sqlservr.exe!CQuery::PqoBuild()  + 0x3cb bytes  
sqlservr.exe!CStmtQuery::InitQuery()  + 0x167 bytes 
sqlservr.exe!CStmtDML::InitNormal()  + 0xf0 bytes   
sqlservr.exe!CStmtDML::Init()  + 0x1b bytes 
sqlservr.exe!CCompPlan::FCompileStep()  + 0x176 bytes   
sqlservr.exe!CSQLSource::FCompile()  + 0x741 bytes  
sqlservr.exe!CSQLSource::FCompWrapper()  + 0x922be bytes    
sqlservr.exe!CSQLSource::Transform()  + 0x120431 bytes  
sqlservr.exe!CSQLSource::Compile()  + 0x2ff bytes   

So it seems to spend a lot of time comparing (presumably sorting) strings. No idea to what end though.

It doesn't seem to affect the size of the query plan when I tried a query consisting entirely of duplicate rows and neither affects the order of the output of the table of the constants (and as you are inserting into a heap time spent sorting would be pointless anyway even if it did).

Are repeated backgrounds in html more efficient if they are larger than 1px?

12 votes

If there is a one-dimensional background that is repeated on a certain dimension, is there any mentionable difference in performance if the image is e.g. 1px wide versus 10 or 20 pixels wide?

I assume you mean a two-dimensional background.

I can't imagine that there is any noticeable difference, on modern computers. However, because bandwidth is still at a premium, especially on mobile devices, I think you would be better off conserving bandwidth by repeating a 1px wide image instead of say 2 or 3 px wide.

UPDATE: We just ran a test, unscientifically, but certainly perceptually relevant, in which one page rendered a 10px green square over a 10,000,000px square div, and another page rendered a 1px green square over the same size div. All styles are set with CSS, both pages had no other content. The graphics were loaded locally. There was absolutely no perceptual difference in the rendering in either Safari 5 for Mac, or FireFox 8 for Mac. Still, it's possible that there could be performance issues on certain models of older (crappier) smart phones.

Python multiprocessing - Pipe vs Queue

11 votes

What are the fundamental differences between queues and pipes in Python's multiprocessing package?

In what scenarios should one choose one over the other? When is it advantageous to use Pipe()? When is it advantageous to use Queue()?

  • A Pipe() can only have two endpoints.

  • A Queue() can have multiple producers and consumers.

When to use them

If you need more than two points to communicate, use a Queue().

If you need absolute performance, a Pipe() is orders of magnitude faster because Queue() is built on top of Pipe().

Performance Benchmarking

Let's assume you want to spawn two processes and send messages between them as quickly as possible. These are the timing results of a drag race between similar tests using Pipe() and Queue()... This is on a ThinkpadT61 running Ubuntu 11.10, and Python 2.7.2.

FYI, I threw in results for JoinableQueue() as a bonus; JoinableQueue() accounts for tasks when queue.task_done() is called (it doesn't even know about the specific task, it just counts unfinished tasks in the queue), so that queue.join() knows the work is finished.

The code for each at bottom of this answer...

mpenning@mpenning-T61:~$ python multi_pipe.py 
Sending 10000 numbers to Pipe() took 0.0369849205017 seconds
Sending 100000 numbers to Pipe() took 0.328398942947 seconds
Sending 1000000 numbers to Pipe() took 3.17266988754 seconds
mpenning@mpenning-T61:~$ python multi_queue.py 
Sending 10000 numbers to Queue() took 0.105256080627 seconds
Sending 100000 numbers to Queue() took 0.980564117432 seconds
Sending 1000000 numbers to Queue() took 10.1611330509 seconds
mpnening@mpenning-T61:~$ python multi_joinablequeue.py 
Sending 10000 numbers to JoinableQueue() took 0.172781944275 seconds
Sending 100000 numbers to JoinableQueue() took 1.5714070797 seconds
Sending 1000000 numbers to JoinableQueue() took 15.8527247906 seconds
mpenning@mpenning-T61:~$

In summary Pipe() is about three times faster than a Queue(). Don't even think about the JoinableQueue() unless you really must have the benefits.


"""
multi_pipe.py
"""
from multiprocessing import Process, Pipe
import time

def reader(pipe):
    output_p, input_p = pipe
    input_p.close()    # We are only reading
    while True:
        try:
            msg = output_p.recv()    # Read msg off the pipe and do nothing
        except EOFError:
            break

def writer(count, input_p):
    for ii in xrange(0, count):
        input_p.send(ii)

if __name__=='__main__':
    for count in [10**4, 10**5, 10**6]:
        output_p, input_p = Pipe()
        reader_p = Process(target=reader, args=((output_p, input_p),))
        reader_p.start()     # Launch the reader process

        output_p.close()       # We no longer need this part of the Pipe()
        _start = time.time()
        writer(count, input_p) # Send a lot of stuff to reader()
        input_p.close()        # Ask the reader to stop when it reads EOF
        reader_p.join()
        print "Sending %s numbers to Pipe() took %s seconds" % (count, 
            (time.time() - _start))

"""
multi_queue.py
"""
from multiprocessing import Process, Queue
import time

def reader(queue):
    while True:
        msg = queue.get()
        if (msg == 'DONE'):
            break

def writer(count, queue):
    for ii in xrange(0, count):
        queue.put(ii)
    queue.put('DONE')

if __name__=='__main__':
    for count in [10**4, 10**5, 10**6]:
        queue = Queue()
        reader_p = Process(target=reader, args=((queue),))
        reader_p.daemon = True
        reader_p.start()     # Launch the reader process

        _start = time.time()
        writer(count, queue)    # Send a lot of stuff to reader()
        reader_p.join()         # Wait for the reader to finish
        print "Sending %s numbers to Queue() took %s seconds" % (count, 
            (time.time() - _start))

"""
multi_joinablequeue.py
"""
from multiprocessing import Process, JoinableQueue
import time

def reader(queue):
    while True:
        msg = queue.get()
        queue.task_done()

def writer(count, queue):
    for ii in xrange(0, count):
        queue.put(ii)

if __name__=='__main__':
    for count in [10**4, 10**5, 10**6]:
        queue = JoinableQueue()
        reader_p = Process(target=reader, args=((queue),))
        reader_p.daemon = True
        reader_p.start()     # Launch the reader process

        _start = time.time()
        writer(count, queue) # Send a lot of stuff to reader()
        queue.join()         # Wait for the reader to finish
        print "Sending %s numbers to JoinableQueue() took %s seconds" % (count, 
            (time.time() - _start))

Comparing the performance of $("#foo .bar") and $(".bar", "#foo")

10 votes

Scroll down for the getById.getByClassName vs. qSA comparison!


If we wanted to select all elements of class "bar" which are inside the element with the ID "foo", we could write this:

$( '#foo .bar' )

or this:

$( '.bar', '#foo' )

There are of course other methods to achieve this, but for the sake of this question, let's compare only these two methods.

So, which of the above methods performs better? (Which needs less time to execute?)

I have written this performance test:

(function() {
    var i;

    console.time('test1');
    for( i = 0; i < 100; i++ ) {
        $('#question-mini-list .tags');
    }
    console.timeEnd('test1');

    console.time('test2');
    for( i = 0; i < 100; i++ ) {
        $('.tags', '#question-mini-list');
    }
    console.timeEnd('test2');
})();

You have to execute it from within the console on the Stack Overflow start-page. My results are:

Firefox:
test1: ~90ms
test2: ~18ms

Chrome:
test1: ~65ms
test2: ~30ms

Opera:
test1: ~50ms
test2: ~100ms

So in Firefox and Chrome, the second method is multiple times faster - just as I expected. However, in Opera the situation is reversed. I wonder what's going on here.

Could you please run the test on your machine and explain why Opera performs differently?


Update

I've written this test, in order to investigate whether Opera's qSA really is super-fast. As it turns out, it is.

(function() {
    var i, limit = 5000, test1 = 'test1', test2 = 'test2';

    console.time( test1 );
    for( i = 0; i < limit; i += 1 ) {
        document.getElementById( 'question-mini-list' ).getElementsByClassName( 'tags' );
    }
    console.timeEnd( test1 );

    console.time( test2 );
    for( i = 0; i < limit; i += 1 ) {
        document.querySelectorAll( '#question-mini-list .tags' );
    }
    console.timeEnd( test2 );
})();

Again, you have to run this code from within the console on the Stack Overflow start-page. I used the Firebug Lite bookmarklet for IE9 (since that browser doesn't implement console.time).

So, I compared this method:

document.getelementById( 'A' ).getElementsByClassName( 'B' );

to this method:

document.querySelectorAll( '#A .B' );

I've executed the above script five consecutive times in each browser. The arithmetic means are:

enter image description here

(All numbers are in milliseconds.)

So, the performance of the first method is pretty much the same in the tested browsers (16-36ms). However, while qSA is much slower compared to the first method, in Opera it actually is faster!

So, qSA optimization is possible, I wonder what the other browsers are waiting for...

jQuery/Sizzle will avoid using the JavaScript based Sizzle engine if the browser supports querySelectorAll, and if you pass a valid selector (no custom, non-CSS selectors).

This means that you're ultimately comparing implementations of querySelectorAll, assuming you're testing browsers that support it.

There are other optimizations that jQuery or Sizzle uses, so it's tricky when comparing different types of DOM selection in different browsers.

The reason for Opera's performance result seems to be that they have a very highly optimized querySelectorAll implementation. qSA, being a relatively new method, hasn't been quite as optimized in some browsers compared to older methods like getElementsByTagName.

Hibernate, JDBC and Java performance on medium and big result set

7 votes

Issue

We are trying to optimize our dataserver application. It stores stocks and quotes over a mysql database. And we are not satisfied with the fetching performances.

Context

- database
    - table stock : around 500 lines
    - table quote : 3 000 000 to 10 000 000 lines
    - one-to-many association : one stock owns n quotes
    - fetching around 1000 quotes per request
    - there is an index on (stockId,date) in the quote table
    - no cache, because in production, querys are always different
- Hibernate 3
- mysql 5.5
- Java 6
- JDBC mysql Connector 5.1.13
- c3p0 pooling

Tests and results

Protocol

  • Execution times on mysql server are obtained with running the generated sql queries in mysql command line bin.
  • The server is in a test context : no other DB readings, no DB writings
  • We fetch 857 quotes for the AAPL stock

Case 1 : Hibernate with association

This fills up our stock object with 857 quotes object (everything correctly mapped in hibernate.xml)

session.enableFilter("after").setParameter("after", 1322910573000L);
Stock stock = (Stock) session.createCriteria(Stock.class).
add(Restrictions.eq("stockId", stockId)).
setFetchMode("quotes", FetchMode.JOIN).uniqueResult();

SQL generated :

SELECT this_.stockId AS stockId1_1_,
       this_.symbol AS symbol1_1_,
       this_.name AS name1_1_,
       quotes2_.stockId AS stockId1_3_,
       quotes2_.quoteId AS quoteId3_,
       quotes2_.quoteId AS quoteId0_0_,
       quotes2_.value AS value0_0_,
       quotes2_.stockId AS stockId0_0_,
       quotes2_.volume AS volume0_0_,
       quotes2_.quality AS quality0_0_,
       quotes2_.date AS date0_0_,
       quotes2_.createdDate AS createdD7_0_0_,
       quotes2_.fetcher AS fetcher0_0_
FROM stock this_
LEFT OUTER JOIN quote quotes2_ ON this_.stockId=quotes2_.stockId
AND quotes2_.date > 1322910573000
WHERE this_.stockId='AAPL'
ORDER BY quotes2_.date ASC

Results :

  • Execution time on mysql server : ~10 ms
  • Execution time in Java : ~400ms

Case 2 : Hibernate without association without HQL

Thinking to increase performance, we've used that code that fetch only the quotes objects and we manually add them to a stock (so we don't fetch repeated infos about the stock for every line). We used createSQLQuery to minimize effects of aliases and HQL mess.

String filter = " AND q.date>1322910573000";
filter += " ORDER BY q.date DESC";
Stock stock = new Stock(stockId);
stock.addQuotes((ArrayList<Quote>) session.createSQLQuery("select * from quote q where stockId='" + stockId + "' " + filter).addEntity(Quote.class).list());

SQL generated :

SELECT *
FROM quote q
WHERE stockId='AAPL'
  AND q.date>1322910573000
ORDER BY q.date ASC

Results :

  • Execution time on mysql server : ~10 ms
  • Execution time in Java : ~370ms

Case 3 : JDBC without Hibernate

String filter = " AND q.date>1322910573000";
filter += " ORDER BY q.date DESC";
Stock stock = new Stock(stockId);
Connection conn = SimpleJDBC.getConnection();
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("select * from quote q where stockId='" + stockId + "' " + filter);
while(rs.next())
{
    stock.addQuote(new Quote(rs.getInt("volume"), rs.getLong("date"), rs.getFloat("value"), rs.getByte("fetcher")));
}
stmt.close();
conn.close();

Results :

  • Execution time on mysql server : ~10 ms
  • Execution time in Java : ~100ms

Our understandings

  • The JDBC driver is common to all the cases
  • There is a fundamental time cost in JDBC driving
  • With similar sql queries, Hibernate spends more time than pure JDBC code in converting result sets in objects
  • Hibernate createCriteria, createSQLQuery or createQuery are similar in time cost
  • In production, where we have lots of writing concurrently, pure JDBC solution seemed to be slower than the hibernate one (maybe because our JDBC solutions was not pooled)
  • Mysql wise, the server seems to behave very well, and the time cost is very acceptable

Our questions

  • Is there a way to optimize the performance of JDBC driver ?
  • And will Hibernate benefit this optimization ?
  • Is there a way to optimize Hibernate performance when converting result sets ?
  • Are we facing something not tunable because of Java fundamental object and memory management ?
  • Are we missing a point, are we stupid and all of this is vain ?
  • Are we french ? Yes.

Your help is very welcome.

Can you do a smoke test with the simples query possible like:

SELECT current_timestamp()

or

SELECT 1 + 1

This will tell you what is the actual JDBC driver overhead. Also it is not clear whether both tests are performed from the same machine.

Is there a way to optimize the performance of JDBC driver ?

Run the same query several thousand times in Java. JVM needs some time to warm-up (class-loading, JIT). Also I assume SimpleJDBC.getConnection() uses C3P0 connection pooling - the cost of establishing a connection is pretty high so first few execution could be slow.

Also prefer named queries to ad-hoc querying or criteria query.

And will Hibernate benefit this optimization ?

Hibernate is a very complex framework. As you can see it consumes 75% of the overall execution time compared to raw JDBC. If you need raw ORM (no lazy-loading, dirty checking, advanced caching), consider mybatis. Or maybe even JdbcTemplate with RowMapper abstraction.

Is there a way to optimize Hibernate performance when converting result sets ?

Not really. Check out the Chapter 19. Improving performance in Hibernate documentation. There is a lot of reflection happening out there + class generation. Once again, Hibernate might not be a best solution when you want to squeeze every millisecond from your database.

However it is a good choice when you want to increase the overall user experience due to extensive caching support. Check out the performance doc again. It mostly talks about caching. There is a first level cache, second level cache, query cache... This is the place where Hibernate might actually outperform simple JDBC - it can cache a lot in a ways you could not even imagine. On the other hand - poor cache configuration would lead to even slower setup.

Check out: Caching with Hibernate + Spring - some Questions!

Are we facing something not tunable because of Java fundamental object and memory management ?

JVM (especially in server configuration) is quite fast. Object creation on the heap is as fast as on the stack in e.g. C, garbage collection has been greatly optimized. I don't think the Java version running plain JDBC would be much slower compared to more native connection. That's why I suggested few improvements in your benchmark.

Are we missing a point, are we stupid and all of this is vain ?

I believe that JDBC is a good choice if performance is your biggest issue. Java has been used successfully in a lot of database-heavy applications.

MVC Render Speedup

6 votes

I just hooked up the mvc-mini-profiler (thanks SO!) on my site and was looking around to see how well I've done up to this point (it's my first major bout with linq to entities and mvc). So far everything is looking good, however I'm always looking for ways to improve response times. At this point it looks like the only major boost I could get would be from reducing the time it takes to render the individual views on each of my pages.

profiler screeny

You can see from my screeny that the rendering of the Blog view is the longest running task. I know that 30ms is already really quick, but I'm betting there are still some tricks I can pull to get these numbers even lower.

So the question is this: How can I reduce view render times? I know that caching of dynamic views into something like the HttpRuntime.Cache can help, but I'm even seeing several ms durations for static view rendering. What techniques do you use to lower the render times of your views?

I suggest 2 things (if you don't have it done yet)...

  1. Remove unused ViewEngines. So if your project uses only the razor view engine, do this in the global.asax on Application_Start();

    ViewEngines.Engines.Clear();
    ViewEngines.Engines.Add(new RazorViewEngine());
    

    or

    ViewEngines.Engines.Add(new WebFormViewEngine());
    

    if you use the WebFormsViewEngine only

  2. The biggest improvment is to use the OutputCacheAttribute to cache the html. I dont think your Blog changes on every Request ;)

    public class BlogController : Controller
    {
        [OutputCache]
        public ActionResult Index()
        {
           // do something here
           return View();
        }
    }
    

You can set the cache-duration and more. Check out: MSDN - OutputCacheAttribute.

Speeding up perl DBI fetchrow_hashref

5 votes

I have something that looks like this:

my $report = new ReportGenerator; #custom object
my $dbh = $dbc->prepare('SELECT * FROM some_table WHERE some_condition'); #DBI handle
$dbh->execute();
while(my $href = $dbh->fetchrow_hashref){
    $report->process_record($href);
}
$dbh->finish();
print $report->printReport();

My problem is that each iteration of the loop is very slow. The problem is the MySQL. I was wondering if it was possible to put some kind of wrapper in the while loop to make it fetch more than one record at a time, at the same time, fetching all records into memory is not practical either. I am not worried about the efficiency of the code(hashref vs arrayref,etc..). Rather, I am interested in fetching lets say 10000 records at a time.

The database has ~5 Million records. I can not change/upgrade the server.

Thanks

You can use the fetchall_arrayref function which accepts a 'maxrows' argument:

while (my $data = $dbc->fetchall_arrayref(undef, 10000)) {
  for my $row( @{$data} ) {
    $report->process_record($row);
  }
}

You could also look at the RowCacheSize property which attempts to control how many records are returned in a fetch from your driver.