## C code loop performance

I have a multiply-add kernel inside my application and I want to increase its performance.

I use an Intel Core i7-960 (3.2 GHz clock) and have already manually implemented the kernel using SSE intrinsics as follows:

`````` for(int i=0; i<iterations; i+=4) {
y1 = _mm_set_ss(output[i]);
y2 = _mm_set_ss(output[i+1]);
y3 = _mm_set_ss(output[i+2]);
y4 = _mm_set_ss(output[i+3]);

for(k=0; k<ksize; k++){
for(l=0; l<ksize; l++){
w  = _mm_set_ss(weight[i+k+l]);

x1 = _mm_set_ss(input[i+k+l]);
…
x4 = _mm_set_ss(input[i+k+l+3]);
}
}
_mm_store_ss(&output[i],y1);
_mm_store_ss(&output[i+1],y2);
_mm_store_ss(&output[i+2],y3);
_mm_store_ss(&output[i+3],y4);
}
``````

I know I can use packed fp vectors to increase the performance and I already did so succesfully, but I want to know why the single scalar code isn't able to meet the processor's peak performance.

The performance of this kernel on my machine is ~1.6 FP operations per cycle, while the maximum would be 2 FP operations per cycle (since FP add + FP mul can be executed in parallel).

If I'm correct from studying the generated assembly code, the ideal schedule would look like follows, where the `mov` instruction takes 3 cycles, the switch latency from the load domain to the FP domain for the dependent instructions takes 2 cycles, the FP multiply takes 4 cycles and the FP add takes 3 cycles. (Note that the dependence from the multiply -> add doesn't incur any switch latency because the operations belong to the same domain).

According to the measured performance (~80% of the maximum theoretical performance) there is an overhead of ~3 instructions per 8 cycles.

I am trying to either:

• get rid of this overhead, or
• explain where it comes from

Of course there is the problem with cache misses & data misalignment which can increase the latency of the move instructions, but are there any other factors that could play a role here? Like register read stalls or something?

I hope my problem is clear, thanks in advance for your responses!

Update: The assembly of the inner-loop looks as follows:

``````...
Block 21:
movssl  (%rsi,%rdi,4), %xmm4
movssl  (%rcx,%rdi,4), %xmm0
movssl  0x4(%rcx,%rdi,4), %xmm1
movssl  0x8(%rcx,%rdi,4), %xmm2
movssl  0xc(%rcx,%rdi,4), %xmm3
inc %rdi
mulss %xmm4, %xmm0
cmp \$0x32, %rdi
mulss %xmm4, %xmm1
mulss %xmm4, %xmm2
mulss %xmm3, %xmm4
jl 0x401b52 <Block 21>
...
``````

I noticed in the comments that:

• The loop takes 5 cycles to execute.
• It's "supposed" to take 4 cycles. (since there's 4 adds and 4 mulitplies)

However, your assembly shows 5 SSE `movssl` instructions. According to Agner Fog's tables all floating-point SSE move instructions are at least 1 inst/cycle reciprocal throughput for Nehalem.

Since you have 5 of them, you can't do better than 5 cycles/iteration.

So in order to get to peak performance, you need to reduce the # of loads that you have. How you can do that I can't see immediately this particular case - but it might be possible.

One common approach is to use tiling. Where you add nesting levels to improve locality. Although it's used mostly for improving cache access, it can also be used in registers to reduce the # of load/stores that are needed.

Ultimately, your goal is to reduce the number of loads to be less than the numbers of add/muls. So this might be the way to go.

## Fast calculation of min, max, and average of incoming numbers

Program is receiving approximately 50,000 numbers every second.

At ANY given moment, I need to calculate minimum, maximum and average of the values (numbers) that arrived in the last second (regarding to given moment).

Is there a way to do this without using array or list (buffer) to store arriving numbers and to calculate results?

If I need to use buffer, what would be the efficient way to achieve this?

(Note that numbers from buffer also must be efficiently removed from time to time)

Here is an algorithm that will somewhat work to save efficiency in certain cases:

1. As events come in, buffer them completely, and calculate a running `sum`, `count`, `min`, `max` (trivial).

2. When a request for `average`, `min`, or `max` is made, loop through from the back of the buffer and start removing values older than one second. Subtract from `sum` and `count` as you go.

• If the values are all above `min` you can keep your `min`. If the values are below `max`, you can keep your `max`. In this scenario, you have `average`, `min`, and `max` updated efficiently.

• If the values are below `min` or above `max`, you'll need to loop through the rest of the array and recalculate it.

3. Do step two once a second or so also so that the buffer doesn't get too full. This code could be performed on every buffer insert also, or wherever made sense.

Best structure for this kind of work is a circular buffer, to avoid memory allocations and GC getting in the way. It should be large enough to cover the worst case scenario for message size per second.

@ja72 provided a great idea to save on finding the min and max values if they are invalidated:

Instead of keeping the min/max values x_min, x_max keep instead the index of where they are located in the x[i] array with i_min and i_max. Then finding them can be trivial sometimes, but when the last value considered holds the min and max, the entire list needs to be scanned to establish the new limits.

Sam Holder had another good idea in the comments - keep a parallel array that is always sorted, this lets you lop numbers off the top or bottom to find new minimums and maximums easier. However, insert speed here is compromised a little (needs to remain in order).

Ultimately, the right choice will depend on the usage characteristics of the program. How often will values be read vs how often they are inserted?

## Why is creating a class in Python so much slower than instantiating a class?

I found that creation of a class is way slower than instantiation of a class.

``````>>> from timeit import Timer as T
>>> def calc(n):
...     return T("class Haha(object): pass").timeit(n)

<<After several these 'calc' things, at least one of them have a big number, eg. 100000>>

>>> calc(9000)
15.947055101394653
>>> calc(9000)
17.39099097251892
>>> calc(9000)
18.824054956436157
>>> calc(9000)
20.33335590362549
``````

Yeah, create 9000 classes took 16 secs, and becomes even slower in the subsequent calls.

And this:

``````>>> T("type('Haha', b, d)", "b = (object, ); d = {}").timeit(9000)
``````

gives similar results.

But instantiation don't suffer:

``````>>> T("Haha()", "class Haha(object): pass").timeit(5000000)
0.8786070346832275
``````

5000000 instances in less than one sec.

What makes the creation this expensive?

And why the creation process become slower?

EDIT:

How to reproduce:

start a fresh python process, the initial several "calc(10000)"s give a number of 0.5 on my machine. And try some bigger values, calc(100000), it can't end in even 10secs, interrupt it, and calc(10000), gives a 15sec.

EDIT:

If you gc.collect() after 'calc' becomes slow, you can get the 'normal' speed at beginning, but the timing will increasing in subsequent calls

``````>>> from a import calc
>>> calc(10000)
0.4673938751220703
>>> calc(10000)
0.4300072193145752
>>> calc(10000)
0.4270968437194824
>>> calc(10000)
0.42754602432250977
>>> calc(10000)
0.4344758987426758
>>> calc(100000)
^CTraceback (most recent call last):
File "<stdin>", line 1, in <module>
File "a.py", line 3, in calc
return T("class Haha(object): pass").timeit(n)
File "/usr/lib/python2.7/timeit.py", line 194, in timeit
timing = self.inner(it, self.timer)
File "<timeit-src>", line 6, in inner
KeyboardInterrupt
>>> import gc
>>> gc.collect()
234204
>>> calc(10000)
0.4237039089202881
>>> calc(10000)
1.5998330116271973
>>> calc(10000)
4.136359930038452
>>> calc(10000)
6.625348806381226
``````

This might give you the intuition:

``````>>> class Haha(object): pass
...
>>> sys.getsizeof(Haha)
904
>>> sys.getsizeof(Haha())
64
``````

Class object is much more complex and expensive structure that an instance of that class.

## Can some explain the performance behavior of the following memory allocating C program?

On my machine Time A and Time B swap depending on whether `A` is defined or not (which changes the order in which the two `calloc`s are called).

I initially attributed this to the paging system. Weirdly, when `mmap` is used instead of `calloc`, the situation is even more bizzare -- both the loops take the same amount of time, as expected. As can be seen with `strace`, the `calloc`s ultimately result in two `mmap`s, so there is no return-already-allocated-memory magic going on.

I'm running Debian testing on an Intel i7.

``````#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>

#include <time.h>

#define SIZE 500002816

#ifndef USE_MMAP
#define ALLOC calloc
#else
#define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE,  \
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0))
#endif

int main() {
clock_t start, finish;
#ifdef A
int *arr1 = ALLOC(sizeof(int), SIZE);
int *arr2 = ALLOC(sizeof(int), SIZE);
#else
int *arr2 = ALLOC(sizeof(int), SIZE);
int *arr1 = ALLOC(sizeof(int), SIZE);
#endif
int i;

start = clock();
{
for (i = 0; i < SIZE; i++)
arr1[i] = (i + 13) * 5;
}
finish = clock();

printf("Time A: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);

start = clock();
{
for (i = 0; i < SIZE; i++)
arr2[i] = (i + 13) * 5;
}
finish = clock();

printf("Time B: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);

return 0;
}
``````

The output I get:

`````` ~/directory \$ cc -Wall -O3 bench-loop.c -o bench-loop
~/directory \$ ./bench-loop
Time A: 0.94
Time B: 0.34
~/directory \$ cc -DA -Wall -O3 bench-loop.c -o bench-loop
~/directory \$ ./bench-loop
Time A: 0.34
Time B: 0.90
~/directory \$ cc -DUSE_MMAP -DA -Wall -O3 bench-loop.c -o bench-loop
~/directory \$ ./bench-loop
Time A: 0.89
Time B: 0.90
~/directory \$ cc -DUSE_MMAP -Wall -O3 bench-loop.c -o bench-loop
~/directory \$ ./bench-loop
Time A: 0.91
Time B: 0.92
``````

The first time that `calloc` is called it is explicitly zeroing out the memory. While the next time that it is called it assumed that the memory returned from `mmap` is already zeroed out.

Details

Here's some of the things that I checked to come to this conclusion that you could try yourself if you wanted:

1. Insert a `calloc` call before your first `ALLOC` call. You will see that after this the Time for Time A and Time B are the same.

2. Use the `clock()` function to check how long each of the `ALLOC` calls take. In the case where they are both using `calloc` you will see that the first call takes much longer than the second one.

3. Use `time` to time the execution time of the `calloc` version and the `USE_MMAP` version. When I did this I saw that the execution time for `USE_MMAP` was consistently slightly less.

4. I ran with `strace -tt -T` which shows both the time of when the system call was made and how long it took. Here is part of the output:

Strace output:

``````21:29:06.127536 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff806fd000 <0.000014>
21:29:07.778442 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff093a0000 <0.000021>
21:29:07.778563 times({tms_utime=63, tms_stime=102, tms_cutime=0, tms_cstime=0}) = 4324241005 <0.000011>
``````

You can see that the first `mmap` call took `0.000014` seconds, but that about `1.5` seconds elapsed before the next system call. Then the second `mmap` call took `0.000021` seconds, and was followed by the `times` call a few hundred microsecond later.

I also stepped through part of the application execution with `gdb` and saw that the first call to `calloc` resulted in numerous calls to `memset` while the second call to `calloc` did not make any calls to `memset`. You can see the source code for `calloc` here (look for `__libc_calloc`) if you are interested. As for why `calloc` is doing the `memset` on the first call but not subsequent ones I don't know. But I feel fairly confident that this explains the behavior you have asked about.

As for why the array that was zeroed `memset` has improved performance my guess is that it is because of values being loaded into the TLB rather than the cache since it is a very large array. Regardless the specific reason for the performance difference that you asked about is that the two `calloc` calls behave differently when they are executed.

## Mysql help needed to optimize group by sub query

Hi I seem to be a little stuck. Its quite a straight forward query.

If I run the quieries seperately it is not that slow but when I combine them its very slow.

I'm not sure how to optimise it. Any help would be much appreciated. Im basically only wanting to show multiple refunds. So where faultid exists more than once.

``````SELECT
r.*
FROM faultrefunds_v2 r
WHERE r.id IN
(
SELECT
r1.id
FROM
faultrefunds_v2 r1
GROUP BY faultid
HAVING count(r1.faultid) > 1
);
``````

The results from explain are have been attached as an image

I guess, this qualifies rather as a re-writing than as an optimisation, but this is what I would try instead, anyway:

``````SELECT
r.*
FROM faultrefunds_v2 r
WHERE EXISTS (
SELECT *
FROM faultrefunds_v2 r1
WHERE r1.faultid = r.faultid
AND r1.id <> r.id
);
``````

## PhoneGap 1.4 wrapping Sencha Touch 2.X - What about performance?

I'm building a multiplatform tablet app wrapping it with Phonegap 1.4 using just its webview, then I work my magic with the Sencha Touch 2 framework. By multiplatform I mean iOS 5.X+ and Android 3.0+ (for now).

This app is working great so far, all its features work on both systems but... On the Android tablet (Samsung GalaxyTab) its really slow. What's happening? Can I do something about it, or its just android's limit?

Thanks

*EDIT* (I'm trying to make this post somewhat useful to the comunity)

Sencha Touch, like many other Javascript Frameworks are not the best example of performance due to javascript itself.

Then Why use Senha Touch?

• In my case: Multiplatform (iOS, Android, Windows Phone, Blackberry, Windows, Mac OSX, Linux. Sharing 80-90% of the code)

## `Mitigating performance issues due to lack of visual pre-process in Android systems:`

1. CSS3 heavy visual process:

• Avoid Transformations and animations
2. Good MVC practices:

• Don't use more views and you actually showing
• Pre-render / Pre-datafetch when possible to avoid render and data process simultaneously

Same here. I've tested many of my Sencha Touch 2 applications on Samsung GalaxyTab and the performance is really terrible. There's a fact (which maybe a part of actual reason) that, iOS does many pre-process and calculation before rendering to make it seems smoother to user's look and feel, while Android tends to render & process simultaneously on the go.

In general, it could be say that, to every cross-platform mobile apps built on Javascript, like Sencha Touch, iOS performance is significantly better than Android. However, Sencha Touch dev team is trying their best to improve this, hopefully it would be better in next releases. You could see this article about iOS & Android devices performance comparison.

http://www.sencha.com/blog/sencha-touch-2-developer-preview/

PS: While it's much relevant to the OS's limit, you can also optimize your app to make it perform better on Android devices. To my experience, the best practice is:

• Do NOT use CSS3 too much.
• Keep your DOM as minimal as possible.

Hope it helps.

## Why does jQuery not provide a .firstChild method?

I have seen plenty of discussion regarding the fastest way to select a first child element using jQuery. As can be expected, the native DOM firstChild property is much faster than using a jQuery selector or combination of selectors -- see http://jsperf.com/jquery-first-child-selection-performance/6. This is not usually a problem -- either it's being used in a place where performance is not a big deal, or it's easy enough to just access the DOM element and use its .firstChild property. However, there are a couple problems with this:

• firstChild could return a text or comment node, rather than an element, as a jQuery selector would return
• If I need to select the first child of multiple elements, I must either use a slow selector, or go to a lot of extra work iterating over DOM elements, adding them to a collection, then putting them back into a jQuery object.

It seems to me that the cost of adding a firstChild method to the core jQuery library would be far smaller than the benefits. I took my own shot at creating such a method for my own use:

``````\$.fn.firstChild = function() {
var ret = [];

this.each(function(){
var el = this.firstChild;

//the DOM firstChild property could return a text node or comment instead of an element
while (el && el.nodeType != 1)
el = el.nextSibling;

if (el) ret.push(el);
});

//maintain jQuery chaining and end() functionality
return this.pushStack(ret);
};
}
``````

In the tests i created at http://jsperf.com/jquery-multiple-first-child-selection, this function performs more than five times faster than any other option. The tests are based on the tests mentioned above, but are selecting the first children of multiple elements, rather than a single element.

Is there something I am missing? A technique that I should be using? Or is this an issue than one should never worry about? Is there a reason to not include a function like this in jQuery?

"Why does jQuery not provide a .firstChild method?"

Feature creep, most likely.

It can be accomplished with other methods as you stated, and if performance is a concern, you can extend jQuery for that specific need as you've done.

You can improve performance in your code a little more...

``````\$.fn.firstChild = function () {
var ret = [];
// use a for loop
for (var i = 0, len = this.length; i < len; i++) {
var this_el = this[i],
el = this_el.firstElementChild; // try firstElementChild first
if (!el) {
el = this_el.firstChild;
while (el && el.nodeType != 1)
el = el.nextSibling;
}
if (el) ret.push(el);
}
//maintain jQuery chaining and end() functionality
return this.pushStack(ret);
};
``````

## Execution Time Slower with each Iteration of the same SPROC

Running the same Stored Procedure from C# .Net application over a network gets progressively slower with each subsequent execution. It appears to take twice the amount of time as the previous execution (up to a max value; read on). The execution time becomes progressively slower until 1 of 2 scenarios happens, at which point the first execution of the SPROC is "fast" again.

1. If an `SqlConnection` is opened and remains open during all testing, the SPROC gets progressively slower until any other SPROC or query is run.
2. If an `SqlConnection` is opened and closed around each execution, the SPROC gets progressively slower until at least 8 minutes has passed.

This only happens with a few Stored Procedures. One is a simple `SELECT` query with 2 `JOINs`, (SPROC 1) another is a massive 1600+ line SPROC (SPROC 2).

The execution times appear to never go beyond exactly 60 seconds for SPROC 1 and 67 seconds for SPROC 2. SPROC 1 takes less than a second to execute initially, and SPROC 2 takes 7 seconds initially.

This also only happens if the SPROC is run using the same `SqlConnection` in the application. As soon as 2 separate `SqlConnection` objects are used, they behave the same as stated above, but are independent. Running the SPROC multiple times on `SqlConnection1` gets progressively slower, but the first time the same SPROC is run on `SqlConnection2`, it's "fast". It will then also get slower when run multiple times on `SqlConnection2`.

This does not happen if the application is run on the same computer with SQL Server 2008 R2 installed (running Windows Server 2008). The execution time is always consistent.

Running the Stored Procedure from within Management Studio also does not get slower with each execution; it is always consistent.

Clearing the execution plan cache (in SQL Server) has no effect on the observed behavior.

It has taken quite a few days to narrow down this issue originally observed in a much larger application, in order to create a test app to easily test/reproduce it.

From what I've read here, there is a timeout of between 4 and 8 minutes (after `SqlConnection.Close()` is called in code) at which point it closes the database connection to the data source. This appears to be in line with the scenario 2 I mentioned above.

This leads me to believe it is related to the `SqlConnection` used (and the underlying database connection to the data source) since connection pooling is enabled in my case, but why am I observing this behavior, and how do I fix it?

We are using the .Net 2.0 Framework, if that makes any difference.

There are many fine details listed above, so please let me know if I need to clarify anything.

The only Stack Overflow question with any similarities is this, but was unrelated to my issue.

Edit: The following code is executed in my WinForms test app on startup:

``````SqlConnectionStringBuilder connectionStringBuilder = new SqlConnectionStringBuilder();

connectionStringBuilder.DataSource = m_DataSource;
connectionStringBuilder.InitialCatalog = m_InitialCatalog;
connectionStringBuilder.UserID = m_UserID;
connectionStringBuilder.IntegratedSecurity = false;
connectionString = connectionStringBuilder.ConnectionString;

m_DatabaseConnection = new SqlConnection(connectionString);
``````

I then have 2 buttons; one of which calls SPROC 1 mentioned above, and the other calls a different SPROC which does not have the same slowdown issue. The following code is executed on either button click (only difference being the SPROC name):

``````m_DatabaseConnection.Open();
m_DatabaseCommand = new SqlCommand("GetCompanies", m_DatabaseConnection);
m_DatabaseCommand.CommandType = CommandType.StoredProcedure;
m_DatabaseCommand.CommandTimeout = 0;

DataSet databaseDataSet = new DataSet();
m_DatabaseConnection.Close();
``````

Here are my ideas to debug this problem:

• Try calling SqlConnection.ClearAllPools() after Disposing the connection. If this fixes the problem, the problem is tied to a particular connection for sure.
• Next, enclose the SPROC in an explicit transaction.
• Next, call SqlConnection.ClearAllPools() before invoking the SPROC.
• How much data does the SPROC return?
• Please post the C# code you are using to open the connection and execute the SPROC.
• Create a standalone console app that is reproducing the behavior you are seeing. This will (likely) prove that something in your app is the problem because the console app will run just fine.