Best performance questions in May 2012

str performance in python

68 votes

While profiling a piece of python code (python 2.6 up to 3.2), I discovered that the str method to convert an object (in my case an integer) to a string is almost an order of magnitude slower than using string formatting.

Here is the benchmark

>>> from timeit import Timer
>>> Timer('str(100000)').timeit()
0.3145311339386332
>>> Timer('"%s"%100000').timeit()
0.03803517023435887

Does anyone know why this is the case? Am I missing something?

'%s' % 100000 is evaluated by the compiler and is equivalent to a constant at run-time.

>>> import dis
>>> dis.dis(lambda: str(100000))
  8           0 LOAD_GLOBAL              0 (str)
              3 LOAD_CONST               1 (100000)
              6 CALL_FUNCTION            1
              9 RETURN_VALUE        
>>> dis.dis(lambda: '%s' % 100000)
  9           0 LOAD_CONST               3 ('100000')
              3 RETURN_VALUE        

% with a run-time expression is not (significantly) faster than str:

>>> Timer('str(x)', 'x=100').timeit()
0.25641703605651855
>>> Timer('"%s" % x', 'x=100').timeit()
0.2169809341430664

Do note that str is still slightly slower, as @DietrichEpp said, this is because str involves lookup and function call operations, while % compiles to a single immediate bytecode:

>>> dis.dis(lambda x: str(x))
  9           0 LOAD_GLOBAL              0 (str)
              3 LOAD_FAST                0 (x)
              6 CALL_FUNCTION            1
              9 RETURN_VALUE        
>>> dis.dis(lambda x: '%s' % x)
 10           0 LOAD_CONST               1 ('%s')
              3 LOAD_FAST                0 (x)
              6 BINARY_MODULO       
              7 RETURN_VALUE        

Of course the above is true for the system I tested on (CPython 2.7); other implementations may differ.

What is the cost of '$(this)'?

35 votes

People here often suggest to cache the jQuery object created from a DOM element, like with this code:

$('#container input').each(function() {
    $(this).addClass('fooClass');
    $(this).attr('data-bar', "bar");
    $(this).css('background-color', 'red');
});
  • Does caching the jQuery object really improve the performance of our code?
  • What happens "behind the scenes" when you pass a DOM element to the jQuery constructor?

Inspired by this question on meta, I'll answer my question.

In the jQuery tag info this warning appears:

The jQuery function $() is expensive. Calling it repeatedly is extremely inefficient.

Well... that is true only for strings selectors, which get parsed with regex to find out what are they:

quickExpr = /^(?:[^#<]*(<[\w\W]+>)[^>]*$|#([\w\-]*)$)/

Then if the string is a selector(other than id) jQuery traverse the DOM to find a match with it's expensive find function:

} else if ( !context || context.jquery ) {
    return ( context || rootjQuery ).find( selector );
}

So yes it's expensive, but that is only true for selectors!
If we pass a DOMElement the only action jQuery does is saving the DOMElement parameter as the context of the new just created jQuery object and setting the length of the context to 1:

// Handle $(DOMElement)
if ( selector.nodeType ) {
    this.context = this[0] = selector; // Selector here is a DOMElement
    this.length = 1;
    return this;
}

I did some tests with jsPerf, and I found that indeed caching the jQuery object has only a little effect:

enter image description here

In Chrome it's only 7% slower. (In IE it's a little bit more significant- 12%)

Why am I observing multiple inheritance to be faster than single?

32 votes

I have the following two files :-

single.cpp :-

#include <iostream>
#include <stdlib.h>

using namespace std;

unsigned long a=0;

class A {
  public:
    virtual int f() __attribute__ ((noinline)) { return a; } 
};

class B : public A {                                                                              
  public:                                                                                                                                                                        
    virtual int f() __attribute__ ((noinline)) { return a; }                                      
    void g() __attribute__ ((noinline)) { return; }                                               
};                                                                                                

int main() {                                                                                      
  cin>>a;                                                                                         
  A* obj;                                                                                         
  if (a>3)                                                                                        
    obj = new B();
  else
    obj = new A();                                                                                

  unsigned long result=0;                                                                         

  for (int i=0; i<65535; i++) {                                                                   
    for (int j=0; j<65535; j++) {                                                                 
      result+=obj->f();                                                                           
    }                                                                                             
  }                                                                                               

  cout<<result<<"\n";                                                                             
}

And

multiple.cpp :-

#include <iostream>
#include <stdlib.h>

using namespace std;

unsigned long a=0;

class A {
  public:
    virtual int f() __attribute__ ((noinline)) { return a; }
};

class dummy {
  public:
    virtual void g() __attribute__ ((noinline)) { return; }
};

class B : public A, public dummy {
  public:
    virtual int f() __attribute__ ((noinline)) { return a; }
    virtual void g() __attribute__ ((noinline)) { return; }
};


int main() {
  cin>>a;
  A* obj;
  if (a>3)
    obj = new B();
  else
    obj = new A();

  unsigned long result=0;

  for (int i=0; i<65535; i++) {
    for (int j=0; j<65535; j++) {
      result+=obj->f();
    }
  }

  cout<<result<<"\n";
}

I am using gcc version 3.4.6 with flags -O2

And this is the timings results I get :-

multiple :-

real    0m8.635s
user    0m8.608s
sys 0m0.003s

single :-

real    0m10.072s
user    0m10.045s
sys 0m0.001s

On the other hand, if in multiple.cpp I invert the order of class derivation thus :-

class B : public dummy, public A {

Then I get the following timings (which is slightly slower than that for single inheritance as one might expect thanks to 'thunk' adjustments to the this pointer that the code would need to do) :-

real    0m11.516s
user    0m11.479s
sys 0m0.002s

Any idea why this may be happening? There doesn't seem to be any difference in the assembly generated for all three cases as far as the loop is concerned. Is there some other place that I need to look at?

Also, I have bound the process to a specific cpu core and I am running it on a real-time priority with SCHED_RR.

EDIT:- This was noticed by Mysticial and reproduced by me. Doing a

cout << "vtable: " << *(void**)obj << endl;

just before the loop in single.cpp leads to single also being as fast as multiple clocking in at 8.4 s just like public A, public dummy.

I think I got at least some further lead on why this may be happening. The assembly for the loops is exactly identical but the object files are not!

For the loop with the cout at first (i.e.)

cout << "vtable: " << *(void**)obj << endl;

for (int i=0; i<65535; i++) {
  for (int j=0; j<65535; j++) {
    result+=obj->f();
  }
}

I get the following in the object file :-

40092d:       bb fe ff 00 00          mov    $0xfffe,%ebx                                       
400932:       48 8b 45 00             mov    0x0(%rbp),%rax                                     
400936:       48 89 ef                mov    %rbp,%rdi                                          
400939:       ff 10                   callq  *(%rax)                                            
40093b:       48 98                   cltq                                                      
40093d:       49 01 c4                add    %rax,%r12                                          
400940:       ff cb                   dec    %ebx                                               
400942:       79 ee                   jns    400932 <main+0x42>                                 
400944:       41 ff c5                inc    %r13d                                              
400947:       41 81 fd fe ff 00 00    cmp    $0xfffe,%r13d                                      
40094e:       7e dd                   jle    40092d <main+0x3d>                                 

However, without the cout, the loops become :- (.cpp first)

for (int i=0; i<65535; i++) {
  for (int j=0; j<65535; j++) {
    result+=obj->f();
  }
}

Now, .obj :-

400a54:       bb fe ff 00 00          mov    $0xfffe,%ebx
400a59:       66                      data16                                                    
400a5a:       66                      data16 
400a5b:       66                      data16                                                    
400a5c:       90                      nop                                                       
400a5d:       66                      data16                                                    
400a5e:       66                      data16                                                    
400a5f:       90                      nop                                                       
400a60:       48 8b 45 00             mov    0x0(%rbp),%rax                                     
400a64:       48 89 ef                mov    %rbp,%rdi                                          
400a67:       ff 10                   callq  *(%rax)
400a69:       48 98                   cltq   
400a6b:       49 01 c4                add    %rax,%r12                                          
400a6e:       ff cb                   dec    %ebx                                               
400a70:       79 ee                   jns    400a60 <main+0x70>                                 
400a72:       41 ff c5                inc    %r13d                                              
400a75:       41 81 fd fe ff 00 00    cmp    $0xfffe,%r13d
400a7c:       7e d6                   jle    400a54 <main+0x64>                          

So I'd have to say it's not really due to false aliasing as Mysticial points out but simply due to these NOPs that the compiler/linker is emitting.

The assembly in both cases is :-

.L30:
        movl    $65534, %ebx
        .p2align 4,,7                   
.L29:
        movq    (%rbp), %rax            
        movq    %rbp, %rdi
        call    *(%rax)
        cltq    
        addq    %rax, %r12                                                                        
        decl    %ebx
        jns     .L29
        incl    %r13d 
        cmpl    $65534, %r13d
        jle     .L30

Now, .p2align 4,,7 will insert data/NOPs until the instruction counter for the next instruction has the last four bits 0's for a maximum of 7 NOPs. Now the address of the instruction just after p2align in the case without cout and before padding would be

0x400a59 = 0b101001011001

And since it takes <=7 NOPs to align the next instruction, it will in fact do so in the object file.

On the other hand, for the case with the cout, the instruction just after .p2align lands up at

0x400932 = 0b100100110010

and it would take > 7 NOPs to pad it to a divisible by 16 boundary. Hence, it doesn't do that.

So the extra time taken is simply due to the NOPs that the compiler pads the code with (for better cache alignment) when compiling with the -O2 flag and not really due to false aliasing.

I think this resolves the issue. I am using http://sourceware.org/binutils/docs/as/P2align.html as my reference for what .p2align actually does.

Penalty to implement Serializable in Java?

24 votes

Is there a penalty to add

implements Serializable

to a Java class? Impact on size of instantiated object or performance?

The cost is close to zero, not worth being concerned about.

Some further details:

  • There is no increase in the size of each object instance
  • There is a small increase in the size of the class itself, but as this is a one-off cost it is trivial when amortised over a large number of instances
  • There may be a slight additional runtime cost for anything that needs to do interface checks at runtime (reflection, instancof lookups, extra pressure on inline caches etc.). Again, this is likely to be negligible for most purposes.
  • Serializable is a marker interface, there are no methods that require to be implemented. Other marker interface examples are: Clonable, SingleThreadModel, Event listener.

Why does JIT order affect performance?

22 votes

Why does the order in which C# methods in .NET 4.0 are just-in-time compiled affect how quickly they execute? For example, consider two equivalent methods:

public static void SingleLineTest()
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();
    int count = 0;
    for (uint i = 0; i < 1000000000; ++i) {
        count += i % 16 == 0 ? 1 : 0;
    }
    stopwatch.Stop();
    Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
}

public static void MultiLineTest()
{
    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();
    int count = 0;
    for (uint i = 0; i < 1000000000; ++i) {
        var isMultipleOf16 = i % 16 == 0;
        count += isMultipleOf16 ? 1 : 0;
    }
    stopwatch.Stop();
    Console.WriteLine("Multi-line test  --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
}

The only difference is the introduction of a local variable, which affects the assembly code generated and the loop performance. Why that is the case is a question in its own right.

Possibly even stranger is that on x86 (but not x64), the order that the methods are invoked has around a 20% impact on performance. Invoke the methods like this...

static void Main()
{
    SingleLineTest();
    MultiLineTest();
}

...and SingleLineTest is faster. (Compile using the x86 Release configuration, ensuring that "Optimize code" setting is enabled, and run the test from outside VS2010.) But reverse the order...

static void Main()
{
    MultiLineTest();
    SingleLineTest();
}

...and both methods take the same time (almost, but not quite, as long as MultiLineTest before). (When running this test, it's useful to add some additional calls to SingleLineTest and MultiLineTest to get additional samples. How many and what order doesn't matter, except for which method is called first.)

Finally, to demonstrate that JIT order is important, leave MultiLineTest first, but force SingleLineTest to be JITed first...

static void Main()
{
    RuntimeHelpers.PrepareMethod(typeof(Program).GetMethod("SingleLineTest").MethodHandle);
    MultiLineTest();
    SingleLineTest();
}

Now, SingleLineTest is faster again.

If you turn off "Suppress JIT optimization on module load" in VS2010, you can put a breakpoint in SingleLineTest and see that the assembly code in the loop is the same regardless of JIT order; however, the assembly code at the beginning of the method varies. But how this matters when the bulk of the time is spent in the loop is perplexing.

A sample project demonstrating this behavior is on github.

It's not clear how this behavior affects real-world applications. One concern is that it can make performance tuning volatile, depending on the order methods happen to be first called. Problems of this sort would be difficult to detect with a profiler. Once you found the hotspots and optimized their algorithms, it would be hard to know without a lot of guess and check whether additional speedup is possible by JITing methods early.

Update: See also the Microsoft Connect entry for this issue.

Please note that I do not trust the "Suppress JIT optimization on module load" option, I spawn the process without debugging and attach my debugger after the JIT has run.

In the version where single-line runs faster, this is Main:

        SingleLineTest();
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  call        dword ptr ds:[0019380Ch] 
            MultiLineTest();
00000009  call        dword ptr ds:[00193818h] 
            SingleLineTest();
0000000f  call        dword ptr ds:[0019380Ch] 
            MultiLineTest();
00000015  call        dword ptr ds:[00193818h] 
            SingleLineTest();
0000001b  call        dword ptr ds:[0019380Ch] 
            MultiLineTest();
00000021  call        dword ptr ds:[00193818h] 
00000027  pop         ebp 
        }
00000028  ret 

Note that MultiLineTest has been placed on an 8 byte boundary, and SingleLineTest on a 4 byte boundary.

Here's Main for the version where both run at the same speed:

            MultiLineTest();
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  call        dword ptr ds:[00153818h] 

            SingleLineTest();
00000009  call        dword ptr ds:[0015380Ch] 
            MultiLineTest();
0000000f  call        dword ptr ds:[00153818h] 
            SingleLineTest();
00000015  call        dword ptr ds:[0015380Ch] 
            MultiLineTest();
0000001b  call        dword ptr ds:[00153818h] 
            SingleLineTest();
00000021  call        dword ptr ds:[0015380Ch] 
            MultiLineTest();
00000027  call        dword ptr ds:[00153818h] 
0000002d  pop         ebp 
        }
0000002e  ret 

Amazingly, the addresses chosen by the JIT are identical in the last 4 digits, even though it allegedly processed them in the opposite order. Not sure I believe that any more.

More digging is necessary. I think it was mentioned that the code before the loop wasn't exactly the same in both versions? Going to investigate.

Here's the "slow" version of SingleLineTest (and I checked, the last digits of the function address haven't changed).

            Stopwatch stopwatch = new Stopwatch();
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  push        edi 
00000004  push        esi 
00000005  push        ebx 
00000006  mov         ecx,7A5A2C68h 
0000000b  call        FFF91EA0 
00000010  mov         esi,eax 
00000012  mov         dword ptr [esi+4],0 
00000019  mov         dword ptr [esi+8],0 
00000020  mov         byte ptr [esi+14h],0 
00000024  mov         dword ptr [esi+0Ch],0 
0000002b  mov         dword ptr [esi+10h],0 
            stopwatch.Start();
00000032  cmp         byte ptr [esi+14h],0 
00000036  jne         00000047 
00000038  call        7A22B314 
0000003d  mov         dword ptr [esi+0Ch],eax 
00000040  mov         dword ptr [esi+10h],edx 
00000043  mov         byte ptr [esi+14h],1 
            int count = 0;
00000047  xor         edi,edi 
            for (uint i = 0; i < 1000000000; ++i) {
00000049  xor         edx,edx 
                count += i % 16 == 0 ? 1 : 0;
0000004b  mov         eax,edx 
0000004d  and         eax,0Fh 
00000050  test        eax,eax 
00000052  je          00000058 
00000054  xor         eax,eax 
00000056  jmp         0000005D 
00000058  mov         eax,1 
0000005d  add         edi,eax 
            for (uint i = 0; i < 1000000000; ++i) {
0000005f  inc         edx 
00000060  cmp         edx,3B9ACA00h 
00000066  jb          0000004B 
            }
            stopwatch.Stop();
00000068  mov         ecx,esi 
0000006a  call        7A23F2C0 
            Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
0000006f  mov         ecx,797C29B4h 
00000074  call        FFF91EA0 
00000079  mov         ecx,eax 
0000007b  mov         dword ptr [ecx+4],edi 
0000007e  mov         ebx,ecx 
00000080  mov         ecx,797BA240h 
00000085  call        FFF91EA0 
0000008a  mov         edi,eax 
0000008c  mov         ecx,esi 
0000008e  call        7A23ABE8 
00000093  push        edx 
00000094  push        eax 
00000095  push        0 
00000097  push        2710h 
0000009c  call        783247EC 
000000a1  mov         dword ptr [edi+4],eax 
000000a4  mov         dword ptr [edi+8],edx 
000000a7  mov         esi,edi 
000000a9  call        793C6F40 
000000ae  push        ebx 
000000af  push        esi 
000000b0  mov         ecx,eax 
000000b2  mov         edx,dword ptr ds:[03392034h] 
000000b8  mov         eax,dword ptr [ecx] 
000000ba  mov         eax,dword ptr [eax+3Ch] 
000000bd  call        dword ptr [eax+1Ch] 
000000c0  pop         ebx 
        }
000000c1  pop         esi 
000000c2  pop         edi 
000000c3  pop         ebp 
000000c4  ret 

And the "fast" version:

            Stopwatch stopwatch = new Stopwatch();
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  push        edi 
00000004  push        esi 
00000005  push        ebx 
00000006  mov         ecx,7A5A2C68h 
0000000b  call        FFE11F70 
00000010  mov         esi,eax 
00000012  mov         ecx,esi 
00000014  call        7A1068BC 
            stopwatch.Start();
00000019  cmp         byte ptr [esi+14h],0 
0000001d  jne         0000002E 
0000001f  call        7A12B3E4 
00000024  mov         dword ptr [esi+0Ch],eax 
00000027  mov         dword ptr [esi+10h],edx 
0000002a  mov         byte ptr [esi+14h],1 
            int count = 0;
0000002e  xor         edi,edi 
            for (uint i = 0; i < 1000000000; ++i) {
00000030  xor         edx,edx 
                count += i % 16 == 0 ? 1 : 0;
00000032  mov         eax,edx 
00000034  and         eax,0Fh 
00000037  test        eax,eax 
00000039  je          0000003F 
0000003b  xor         eax,eax 
0000003d  jmp         00000044 
0000003f  mov         eax,1 
00000044  add         edi,eax 
            for (uint i = 0; i < 1000000000; ++i) {
00000046  inc         edx 
00000047  cmp         edx,3B9ACA00h 
0000004d  jb          00000032 
            }
            stopwatch.Stop();
0000004f  mov         ecx,esi 
00000051  call        7A13F390 
            Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
00000056  mov         ecx,797C29B4h 
0000005b  call        FFE11F70 
00000060  mov         ecx,eax 
00000062  mov         dword ptr [ecx+4],edi 
00000065  mov         ebx,ecx 
00000067  mov         ecx,797BA240h 
0000006c  call        FFE11F70 
00000071  mov         edi,eax 
00000073  mov         ecx,esi 
00000075  call        7A13ACB8 
0000007a  push        edx 
0000007b  push        eax 
0000007c  push        0 
0000007e  push        2710h 
00000083  call        782248BC 
00000088  mov         dword ptr [edi+4],eax 
0000008b  mov         dword ptr [edi+8],edx 
0000008e  mov         esi,edi 
00000090  call        792C7010 
00000095  push        ebx 
00000096  push        esi 
00000097  mov         ecx,eax 
00000099  mov         edx,dword ptr ds:[03562030h] 
0000009f  mov         eax,dword ptr [ecx] 
000000a1  mov         eax,dword ptr [eax+3Ch] 
000000a4  call        dword ptr [eax+1Ch] 
000000a7  pop         ebx 
        }
000000a8  pop         esi 
000000a9  pop         edi 
000000aa  pop         ebp 
000000ab  ret 

Just the loops, fast on the left, slow on the right:

00000030  xor         edx,edx                 00000049  xor         edx,edx 
00000032  mov         eax,edx                 0000004b  mov         eax,edx 
00000034  and         eax,0Fh                 0000004d  and         eax,0Fh 
00000037  test        eax,eax                 00000050  test        eax,eax 
00000039  je          0000003F                00000052  je          00000058 
0000003b  xor         eax,eax                 00000054  xor         eax,eax 
0000003d  jmp         00000044                00000056  jmp         0000005D 
0000003f  mov         eax,1                   00000058  mov         eax,1 
00000044  add         edi,eax                 0000005d  add         edi,eax 
00000046  inc         edx                     0000005f  inc         edx 
00000047  cmp         edx,3B9ACA00h           00000060  cmp         edx,3B9ACA00h 
0000004d  jb          00000032                00000066  jb          0000004B 

The instructions are identical (being relative jumps, the machine code is identical even though the disassembly shows different addresses), but the alignment is different. There are three jumps. the je loading a constant 1 is aligned in the slow version and not in the fast version, but it hardly matters, since that jump is only taken 1/16 of the time. The other two jumps ( jmp after loading a constant zero, and jb repeating the entire loop) are taken millions more times, and are aligned in the "fast" version.

I think this is the smoking gun.

Cast performance from size_t to double

21 votes

TL;DR: Why is multiplying/casting data in size_t slow and why does this vary per platform?

I'm having some performance issues that I don't fully understand. The context is a camera frame grabber where a 128x128 uint16_t image is read and post-processed at a rate of several 100 Hz.

In the post-processing I generate a histogram frame->histo which is of uint32_t and has thismaxval = 2^16 elements, basically I tally all intensity values. Using this histogram I calculate the sum and squared sum:

double sum=0, sumsquared=0;
size_t thismaxval = 1 << 16;

for(size_t i = 0; i < thismaxval; i++) {
    sum += (double)i * frame->histo[i];
    sumsquared += (double)(i * i) * frame->histo[i];
}

Profiling the code with profile I got the following (samples, percentage, code):

 58228 32.1263 :  sum += (double)i * frame->histo[i];
116760 64.4204 :  sumsquared += (double)(i * i) * frame->histo[i];

or, the first line takes up 32% of CPU time, the second line 64%.

I did some benchmarking and it seems to be the datatype/casting that's problematic. When I change the code to

uint_fast64_t isum=0, isumsquared=0;

for(uint_fast32_t i = 0; i < thismaxval; i++) {
    isum += i * frame->histo[i];
    isumsquared += (i * i) * frame->histo[i];
}

it runs ~10x faster. However, this performance hit also varies per platform. On the workstation, a Core i7 CPU 950 @ 3.07GHz the code is 10x faster. On my Macbook8,1, which has a Intel Core i7 Sandy Bridge 2.7 GHz (2620M) the code is only 2x faster.

Now I am wondering:

  1. Why is the original code so slow and easily sped up?
  2. Why does this vary per platform so much?

Update:

I compiled the above code with

g++ -O3  -Wall cast_test.cc -o cast_test

Update2:

I ran the optimized codes through a profiler (Instruments on Mac, like Shark) and found two things:

1) The looping itself takes a considerable amount of time in some cases. thismaxval is of type size_t.

  1. for(size_t i = 0; i < thismaxval; i++) takes 17% of my total runtime
  2. for(uint_fast32_t i = 0; i < thismaxval; i++) takes 3.5%
  3. for(int i = 0; i < thismaxval; i++) does not show up in the profiler, I assume it's less than 0.1%

2) The datatypes and casting matter as follows:

  1. sumsquared += (double)(i * i) * histo[i]; 15% (with size_t i)
  2. sumsquared += (double)(i * i) * histo[i]; 36% (with uint_fast32_t i)
  3. isumsquared += (i * i) * histo[i]; 13% (with uint_fast32_t i, uint_fast64_t isumsquared)
  4. isumsquared += (i * i) * histo[i]; 11% (with int i, uint_fast64_t isumsquared)

Surprisingly, int is faster than uint_fast32_t?

Update4:

I ran some more tests with different datatypes and different compilers, on one machine. The results are as follows.

For testd 0 -- 2 the relevant code is

    for(loop_t i = 0; i < thismaxval; i++)
        sumsquared += (double)(i * i) * histo[i];

with sumsquared a double, and loop_t size_t, uint_fast32_t and int for tests 0, 1 and 2.

For tests 3--5 the code is

    for(loop_t i = 0; i < thismaxval; i++)
        isumsquared += (i * i) * histo[i];

with isumsquared of type uint_fast64_t and loop_t again size_t, uint_fast32_t and int for tests 3, 4 and 5.

The compilers I used are gcc 4.2.1, gcc 4.4.7, gcc 4.6.3 and gcc 4.7.0. The timings are in percentages of total cpu time of the code, so they show relative performance, not absolute (although the runtime was quite constant at 21s). The cpu time is for both two lines, because I'm not quite sure if the profiler correctly separated the two lines of code.

gcc:    4.2.1  4.4.7  4.6.3  4.7.0
----------------------------------
test 0: 21.85  25.15  22.05  21.85
test 1: 21.9   25.05  22     22
test 2: 26.35  25.1   21.95  19.2
test 3: 7.15   8.35   18.55  19.95
test 4: 11.1   8.45   7.35   7.1
test 5: 7.1    7.8    6.9    7.05

or:

casting performance

Based on this, it seems that casting is expensive, regardless of what integer type I use.

Also, it seems gcc 4.6 and 4.7 are not able to optimize loop 3 (size_t and uint_fast64_t) properly.

For your original questions:

  1. The code is slow because it involves the conversion from integer to float data types. That's why it's easily sped up when you use also an integer datatype for the sum-variables because it doesn't require a float-conversion anymore.
  2. The difference is the result of several factors. For example it depends on how efficient a platform is able to perform an int->float conversion. Furthermore this conversion could also mess up processor-internal optimizations in the program flow and prediction engine, caches, ... and also the internal parallelizing-features of the processors can have a huge influence in such calculations.

For the additional questions:

  • "Surprisingly int is faster than uint_fast32_t"? What's the sizeof(size_t) and sizeof(int) on your platform? One guess I can make is, that both are probably 64bit and therefore a cast to 32bit not only can give you calculation errors but also includes a different-size-casting penalty.

In general try to avoid visible and hidden casts as good as possible if these aren't really necessary. For example try to find out what real datatype is hidden behind "size_t" on your environment (gcc) and use that one for the loop-variable. In your example the square of uint's cannot be a float datatype so it makes no sense to use double here. Stick to integer types to achieve maximum performance.

What is the Most Efficient way to compare large List of integers to smaller List of integers?

21 votes

At the moment I have a list of 1million integers, and I check each integer against a blacklist of 2000 integers. This is taking about 2 minutes.

for(int i = 0; i< MillionIntegerList.Length ; i++)
{
    for(int blacklisted = 0; blacklisted < TwoThousandIntegerList.Length ; blacklisted++)
        if(i==blacklisted)
            i = 0; //Zero is a sentinel value 
}

This makes 2,000,000,000 iterations(loops) altogether. Is there a better way Im not seeing? thanks

Three options now - the first two are more general, in that they don't rely on MillionIntegerList being sorted (which wasn't originally specified). The third is preferable in the case where the large list is already sorted.

Option 1

Yes, there's definitely a better way of doing it, using LINQ:

var common = MillionIntegerList.Intersect(TwoThousandIntegerList).ToList();

That will internally use a HashSet<int> built via the TwoThousandIntegerList, then look up each element of MillionIntegerList within it - which will be much more efficient than going through the whole of TwoThousandIntegerList each time.

If you only want the non-blacklisted ones, you need:

var valid = MillionIntegerList.Except(TwoThousandIntegerList).ToList();

Note that if you only need to iterate over the results once, you should remove the ToList call - I've included it to materialize the results so they can be examined multiple times cheaply. If you're just iterating, the return value of Intersect or Except will just stream the results, making it much cheaper in terms of memory usage.

Option 2

If you don't want to rely on the implementation details of LINQ to Objects but you still want a hash-based approach:

var hashSet = new HashSet<int>(TwoThousandIntegerList);
hashSet.IntersectWith(MillionIntegerList);
// Now use hashSet

Option 3

The approach of using the fact that the large list is sorted would definitely be useful.

Assuming you don't mind sorting the blacklisted list first as well, you could write a streaming (and general purpose) implementation like this (untested):

// Note: to use this, you'd need to make sure that *both* sequences are sorted.
// You could either sort TwoThousandIntegerList in place, or use LINQ's OrderBy
// method.

public IEnumerable<T> SortedIntersect<T>(this IEnumerable<T> first,
    IEnumerable<T> second) where T : IComparable<T>
{
    using (var firstIterator = first.GetEnumerator())
    {
        if (!firstIterator.MoveNext())
        {
            yield break;
        }

        using (var secondIterator = second.GetEnumerator())
        {
            if (!secondIterator.MoveNext())
            {
                yield break;
            }
            T firstValue = firstIterator.Current;
            T secondValue = secondIterator.Current;

            while (true)
            {
                int comparison = firstValue.CompareTo(secondValue);
                if (comparison == 0) // firstValue == secondValue
                {
                    yield return firstValue;
                }
                else if (comparison < 0) // firstValue < secondValue
                {
                    if (!firstIterator.MoveNext())
                    {
                        yield break;
                    }
                    firstValue = firstIterator.Current;
                }
                else // firstValue > secondValue
                {
                    if (!secondIterator.MoveNext())
                    {
                        yield break;
                    }
                    secondValue = secondIterator.Current;
                }  
            }                
        }
    }
}

(You could take an IComparer<T> if you wanted instead of relying on T being comparable.)

Performance test tool for iPhone application

9 votes

I am looking for free tool which can help conduction performance test of iPhone application. Is any one familiar with such tool to monitor the iPhone app performance?

Instruments (which is included in Xcode IDE) already provides a good suite of profiling and performance tools:

  • allocations of objects, and mark heap capability to see how your heap grows
  • time profiler that samples all the processes and helps you finding bottlenecks
  • system trace to see how processes inside the applications are scheduled and executed
  • energy diagnostics which helps profiling energy consumption of various devices used by your application

I don't know which kind of performance metrics you need to profile but as usually the good old sentence

95% of the CPU time is spent in 5% of the code

is true even for iOS applications so by profiling with time profiler you will mostly solve any performance issue. Just make sure to study its characteristics a little bit (eg. reverse call stack / objc only etc)

Slow MySQL Remote Connection

8 votes

Currently our site is running on 1 server (Ubuntu 10.04 - Rackspace is our host), and in order to be able to handle traffic spikes, we are currently using the highest option that Rackspace offers (30 GB RAM and an 8-core CPU).

CPU is our bottleneck, so I would like to put MySQL on its own server. I have tried doing this, but unfortunately it's adding 9 seconds to the page load time. PHP / MySQL is connecting from 1 server to the other through Rackspace's ServiceNet (local ip address). Before using the ServiceNet ip address, the page load was ridiculously slow (over 40 seconds).

I have added "skip-name-resolve" to my.cnf, and this did not seem to improve performance at all.

I am just wondering what options I have to reduce the remote MySQL connection time. It seems like there has to be something I'm missing because an extra 9 seconds is way too much.

Cloned server running MySQL from the other server: http://173.45.255.52/index.php?action=browse_members

Live server running MySQL locally: http://bros4bros.com/index.php?action=browse_members


UPDATE:

I did a simple query, and my page load time was very fast (270 ms), so I did a much larger query (SELECT * FROM cities WHERE 1... this table has almost 3,000,000 records), and the response time is still not much different from the live site: http://173.45.255.52/simple_query.php

I'm kind of stumped. How can it be adding 9 seconds to the site load time just by running MySQL remotely?

A waterfall shows it's just waiting for 9 seconds.

waiting

The live site's waterfall is below:

live waterfall


UPDATE 2:

Using mysql_pconnect instead of mysql_connect makes no difference. There is still an extra 8 - 9 seconds of waiting. Pinging one server to the other is less than 1 ms. I'm tearing my hair out now.

Temporarily disabling the firewall has no effect.

Using "mysqli_connect" instead of "mysql_connect" successfully connects but queries return no results... SCRATCHES HEAD


UPDATE 3:

Successfully updated the code to use "mysqli_connect" instead of "mysql_connect", and there was no performance benefit. I had to change my "mysql_query" statements to "mysqli_query", etc.


UPDATE 4:

Yeah, that sounds like it has to be the problem (in response to Jens' comment below). We're displaying 60 members on the page, so if we're doing two small queries per member displayed, that's 120 queries. I guess they're just adding up. Time to hit the code to find a more elegant solution to minimize queries... Interestingly enough I destroyed the second server, cloned the first server again, and now I'm only seeing an extra 2.5 seconds. If I develop a more elegant solution to minimize queries, I've got to be able to get that to negligible.

Reduce the number of MySQL queries!

I am in the process of converting a lot of small queries into fewer larger queries, and it is working. The site is now very fast when running MySQL from the other server.

Is this a correct database design?

6 votes

I'm working with the new version of an application. In the current version the database structure is changed, they say "to improve performance".

The old version of the DB had a general structure like this:

TABLE ENTITY
(
    ENTITY_ID,
    STANDARD_PROPERTY_1,
    STANDARD_PROPERTY_2,
    STANDARD_PROPERTY_3,
    ...
)

TABLE ENTITY_PROPERTIES
(
    ENTITY_ID,
    PROPERTY_KEY,
    PROPERTY_VALUE
)

so we had a main table with fields for the basic properties and a table to manage custom properties added by user.

The new version of the DB insted has a structure like this:

TABLE ENTITY
(
    ENTITY_ID,
    STANDARD_PROPERTY_1,
    STANDARD_PROPERTY_2,
    STANDARD_PROPERTY_3,
    ...
)

TABLE ENTITY_PROPERTIES_n
(
    ENTITY_ID_n,
    CUSTOM_PROPERTY_1,
    CUSTOM_PROPERTY_2,
    CUSTOM_PROPERTY_3,
    ...
)

So, now when the user add a custom property, a new column is added to the current ENTITY_PROPERTY table until the max number of columns (managed by application) is reached, then a new table is created.

So, my question is: Is this a correct way to design a DB structure? Is this the only way to "increase performances"? The old structure required a number of join or sub-select, but this structute don't seems to me very smart (or even correct)...

I have seen this done before on the assumed (often unproven) "expense" of joining - it is basically turning a row-heavy data table into a column-heavy table. They ran into their own limitation, as you imply, by creating new tables when they run out of columns.

I completely disagree with it.

Personally, I would stick with the old structure and re-evaluate the performance issues. That isn't to say the old way is the correct way, it is just marginally better than the "improvement" in my opinion, and removes the need to do large scale re-engineering of database tables and DAL code.

These tables strike me as largely static... caching would be an even better performance improvement without mutilating the database and one I would look at doing first. Do the "expensive" fetch once and stick it in memory somewhere, then forget about your troubles (note, I am making light of the need to manage the Cache, but static data is one of the easiest to manage).

Or, wait for the day you run into the maximum number of tables per database :-)

Others have suggested completely different stores. This is a perfectly viable possibility and if I didn't have an existing database structure I would be considering it too. That said, I see no reason why this structure can't fit into an RDBMS. I have seen it done on almost all large scale apps I have worked on. Interestingly enough, they all went down a similar route and all were mostly "successful" implementations.

How to improve performance of single-page application?

5 votes

Introduction
I have a (mostly) single-page application built with BackboneJS and a Rails backend.

Because most of the interaction happens on one page of the webapp, when the user first visits the page I basically have to pull a ton of information out of the database in one large deeply joined query.

This is causing me some rather extreme load times on this one page.

load times

NewRelic appears to be telling me that most of my problems are because of 457 individual fast method calls.

fast methid calls

Now I've done all the eager loading I can do (I checked with the Bullet gem) and I still have a problem.

These method calls are most likely ocurring in my Rabl serializer which I use to serialize a bunch of JSON to embed into the page for initializing Backbone. You don't need to understand all this but suffice to say it could add up to 457 method calls.

object @search
attributes :id, :name, :subscription_limit

# NOTE: Include a list of the members of this search.
child :searchers => :searchers do
  attributes :id, :name, :gravatar_icon
end

# Each search has many concepts (there could be over 100 of them).
child :concepts do |search|
  attributes :id, :title, :search_id, :created_at

  # The person who suggested each concept.
  child :suggester => :suggester do
    attributes :id, :name, :gravatar_icon
  end

  # Each concept has many suggestions (approx. 4 each).
  node :suggestions do |concept|
    # Here I'm scoping suggestions to only ones which meet certain conditions.
    partial "suggestions/show", object: concept.active_suggestions
  end

  # Add a boolean flag to tell if the concept is a favourite or not.
  node :favourite_id do |concept|
    # Another method call which occurs for each concept.
    concept.favourite_id_for(current_user)
  end
end

# Each search has subscriptions to certain services (approx. 4). 
child :service_subscriptions do
  # This contains a few attributes and 2 fairly innocuous method calls.
  extends "service_subscriptions/show"
end

So it seems that I need to do something about this but I'm not sure what approach to take. Here is a list of potential ideas I have:

Performance Improvement Ideas

Dumb-Down the Interface
Maybe I can come up with ways to present information to the user which don't require the actual data to be present. I don't see why I should absolutely need to do this though, other single-page apps such as Trello have incredibly complicated interfaces.

Concept Pagination
If I paginate concepts it will reduct the amount of data being extracted from the database each time. Would product an inferior user interface though.

Caching
At the moment, refreshing the page just extracts the entire search out of the DB again. Perhaps I can cache parts of the app to reduce on DB hits. This seems messy though because not much of the data I'm dealing with is static.

Multiple Requests
It is technically bad to serve the page without embedding the JSON into the page but perhaps the user will feel like things are happening faster if I load the page unpopulated and then fetch the data.

Indexes
I should make sure that I have indexes on all my foreign keys. I should also try to think about places where it would help to have indexes (such as favourites?) and add them.

Move Method Calls into DB
Perhaps I can cache some of the results of the iteration I do in my view layer into the DB and just pull them out instead of computing them. Or I could sync things on write rather than on read.

Question
Does anyone have any suggestions as to what I should be spending my time on?

This is a hard question to answer without being able to see the actual user interface, but I would focus on loading exactly only as much data as is required to display the initial interface. For example, if the user has to drill down to see some of the data you're presenting, then you can load that data on demand, rather than loading it as part of the initial payload. You mention that a search can have as many as 100 "concepts," maybe you don't need to fetch all of those concepts initially?

Bottom line, it doesn't sound like your issue is really on the client side -- it sounds like your server-side code is slowing things down, so I'd explore what you can do to fetch less data, or to defer the complex queries until they are definitely required.

Is it faster to query a List<T> or database?

5 votes

I have recently had several situations where I need different data from the same table. One example is where I would loop through each "delivery driver" and generate a printable PDF file for each customer they are to deliver to.

In this situation, I pulled all customers and stored them into

List<Customer> AllCustomersList = customers.GetAllCustomers();

As I looped through the delivery drivers, I'd do something like this:

List<Customer> DeliveryCustomers = AllCustomersList.Where(a => a.DeliveryDriverID == DriverID);

My question: Is the way I'm doing it by querying the List object faster than querying the database each time for customer records associated with the delivery driver?

There isn't an accurate number for amount of rows that if you pass it you should query the DB instead in in-memory List<T>

But the rule of thumb is, DB are designed to work with large amount of data and they have optimization "mechanisms" while in in-memory there aren't such things.

So you will need to benchmark it to see if the round-trip to DB is worth it for that amount of rows for each time it's important to you

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"

How to merge >1000 xml files into one in Java

4 votes

I am trying to merge many xml files into one. I have successfully done that in DOM, but this solution is limited to a few files. When I run it on multiple files >1000 I am getting a java.lang.OutOfMemoryError.

What I want to achieve is where i have the following files

file 1:

<root>
....
</root>

file 2:

<root>
......
</root>

file n:

<root>
....
</root>

resulting in: output:

<rootSet>
<root>
....
</root>
<root>
....
</root>
<root>
....
</root>
</rootSet>

This is my current implementation:

    DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
    Document doc = docBuilder.newDocument();
    Element rootSetElement = doc.createElement("rootSet");
    Node rootSetNode = doc.appendChild(rootSetElement);
    Element creationElement = doc.createElement("creationDate");
    rootSetNode.appendChild(creationElement);
    creationElement.setTextContent(dateString); 
    File dir = new File("/tmp/rootFiles");
    String[] files = dir.list();
    if (files == null) {
        System.out.println("No roots to merge!");
    } else {
        Document rootDocument;
            for (int i=0; i<files.length; i++) {
                       File filename = new File(dir+"/"+files[i]);        
               rootDocument = docBuilder.parse(filename);
               Node tempDoc = doc.importNode((Node) Document.getElementsByTagName("root").item(0), true);
               rootSetNode.appendChild(tempDoc);
        }
    }   

I have experimented a lot with xslt, sax, but I seem to keep missing something. Any help would be highly appreciated

You might also consider using StAX. Here's code that would do what you want:

import java.io.File;
import java.io.FileWriter;
import java.io.Writer;

import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.stream.StreamSource;

public class XMLConcat {
    public static void main(String[] args) throws Throwable {
        File dir = new File("/tmp/rootFiles");
        File[] rootFiles = dir.listFiles();

        Writer outputWriter = new FileWriter("/tmp/mergedFile.xml");
        XMLOutputFactory xmlOutFactory = XMLOutputFactory.newFactory();
        XMLEventWriter xmlEventWriter = xmlOutFactory.createXMLEventWriter(outputWriter);
        XMLEventFactory xmlEventFactory = XMLEventFactory.newFactory();

        xmlEventWriter.add(xmlEventFactory.createStartDocument());
        xmlEventWriter.add(xmlEventFactory.createStartElement("", null, "rootSet"));

        XMLInputFactory xmlInFactory = XMLInputFactory.newFactory();
        for (File rootFile : rootFiles) {
            XMLEventReader xmlEventReader = xmlInFactory.createXMLEventReader(new StreamSource(rootFile));
            XMLEvent event = xmlEventReader.nextEvent();
            // Skip ahead in the input to the opening document element
            while (event.getEventType() != XMLEvent.START_ELEMENT) {
                event = xmlEventReader.nextEvent();
            }

            do {
                xmlEventWriter.add(event);
                event = xmlEventReader.nextEvent();
            } while (event.getEventType() != XMLEvent.END_DOCUMENT);
            xmlEventReader.close();
        }

        xmlEventWriter.add(xmlEventFactory.createEndElement("", null, "rootSet"));
        xmlEventWriter.add(xmlEventFactory.createEndDocument());

        xmlEventWriter.close();
        outputWriter.close();
    }
}

One minor caveat is that this API seems to mess with empty tags, changing <foo/> into <foo></foo>.