Best performance questions in August 2010

Optimize me! (C, performance) -- followup to bit-twiddling question

25 votes

Thanks to some very helpful stackOverflow users at Bit twiddling: which bit is set?, I have constructed my function (posted at the end of the question).

Any suggestions -- even small suggestions -- would be appreciated. Hopefully it will make my code better, but at the least it should teach me something. :)

Overview

This function will be called at least 1013 times, and possibly as often as 1015. That is, this code will run for months in all likelihood, so any performance tips would be helpful.

This function accounts for 72-77% of the program's time, based on profiling and about a dozen runs in different configurations (optimizing certain parameters not relevant here).

At the moment the function runs in an average of 50 clocks. I'm not sure how much this can be improved, but I'd be thrilled to see it run in 30.

Key Observation

If at some point in the calculation you can tell that the value that will be returned will be small (exact value negotiable -- say, below a million) you can abort early. I'm only interested in large values.

This is how I hope to save the most time, rather than by further micro-optimizations (though these are of course welcome as well!).

Performance Information

  • smallprimes is a bit array (64 bits); on average about 8 bits will be set, but it could be as few as 0 or as many as 12.
  • q will usually be nonzero. (Notice that the function exits early if q and smallprimes are zero.)
  • r and s will often be 0. If q is zero, r and s will be too; if r is zero, s will be too.
  • As the comment at the end says, nu is usually 1 by the end, so I have an efficient special case for it.
  • The calculations below the special case may appear to risk overflow, but through appropriate modeling I have proved that, for my input, this will not occur -- so don't worry about that case.
  • Functions not defined here (ugcd, minuu, star, etc.) have already been optimized; none take long to run. pr is a small array (all in L1). Also, all functions called here are pure functions.
  • But if you really care... ugcd is the gcd, minuu is the minimum, vals is the number of trailing binary 0s, __builtin_ffs is the location of the leftmost binary 1, star is (n-1) >> vals(n-1), pr is an array of the primes from 2 to 313.
  • The calculations are currently being done on a Phenom II 920 x4, though optimizations for i7 or Woodcrest are still of interest (if I get compute time on other nodes).
  • I would be happy to answer any questions you have about the function or its constituents.

What it actually does

Added in response to a request. You don't need to read this part.

The input is an odd number n with 1 < n < 4282250400097. The other inputs provide the factorization of the number in this particular sense:

smallprimes&1 is set if the number is divisible by 3, smallprimes&2 is set if the number is divisible by 5, smallprimes&4 is set if the number is divisible by 7, smallprimes&8 is set if the number is divisible by 11, etc. up to the most significant bit which represents 313. A number divisible by the square of a prime is not represented differently from a number divisible by just that number. (In fact, multiples of squares can be discarded; in the preprocessing stage in another function multiples of squares of primes <= lim have smallprimes and q set to 0 so they will be dropped, where the optimal value of lim is determined by experimentation.)

q, r, and s represent larger factors of the number. Any remaining factor (which may be greater than the square root of the number, or if s is nonzero may even be less) can be found by dividing factors out from n.

Once all the factors are recovered in this way, the number of bases, 1 <= b < n, to which n is a strong pseudoprime are counted using a mathematical formula best explained by the code.

Improvements so far

  • Pushed the early exit test up. This clearly saves work so I made the change.
  • The appropriate functions are already inline, so __attribute__ ((inline)) does nothing. Oddly, marking the main function bases and some of the helpers with __attribute ((hot)) hurt performance by almost 2% and I can't figure out why (but it's reproducible with over 20 tests). So I didn't make that change. Likewise, __attribute__ ((const)), at best, did not help. I was more than slightly surprised by this.

Code

ulong bases(ulong smallprimes, ulong n, ulong q, ulong r, ulong s)
{
    if (!smallprimes & !q)
        return 0;

    ulong f = __builtin_popcountll(smallprimes) + (q > 1) + (r > 1) + (s > 1);
    ulong nu = 0xFFFF;  // "Infinity" for the purpose of minimum
    ulong nn = star(n);
    ulong prod = 1;

    while (smallprimes) {
        ulong bit = smallprimes & (-smallprimes);
        ulong p = pr[__builtin_ffsll(bit)];
        nu = minuu(nu, vals(p - 1));
        prod *= ugcd(nn, star(p));
        n /= p;
        while (n % p == 0)
            n /= p;
        smallprimes ^= bit;
    }
    if (q) {
        nu = minuu(nu, vals(q - 1));
        prod *= ugcd(nn, star(q));
        n /= q;
        while (n % q == 0)
            n /= q;
    } else {
        goto BASES_END;
    }
    if (r) {
        nu = minuu(nu, vals(r - 1));
        prod *= ugcd(nn, star(r));
        n /= r;
        while (n % r == 0)
            n /= r;
    } else {
        goto BASES_END;
    }
    if (s) {
        nu = minuu(nu, vals(s - 1));
        prod *= ugcd(nn, star(s));
        n /= s;
        while (n % s == 0)
            n /= s;
    }

    BASES_END:
    if (n > 1) {
        nu = minuu(nu, vals(n - 1));
        prod *= ugcd(nn, star(n));
        f++;
    }

    // This happens ~88% of the time in my tests, so special-case it.
    if (nu == 1)
        return prod << 1;

    ulong tmp = f * nu;
    long fac = 1 << tmp;
    fac = (fac - 1) / ((1 << f) - 1) + 1;
    return fac * prod;
}

You seem to be wasting much time doing divisions by the factors. It is much faster to replace a division with a multiplication by the reciprocal of divisor (division: ~15-80(!) cycles, depending on the divisor, multiplication: ~4 cycles), IF of course you can precompute the reciprocals.

While this seems unlikely to be possible with q, r, s - due to the range of those vars, it is very easy to do with p, which always comes from the small, static pr[] array. Precompute the reciprocals of those primes and store them in another array. Then, instead of dividing by p, multiply by the reciprocal taken from the second array. (Or make a single array of structs.)

Now, obtaining exact division result by this method requires some trickery to compensate for rounding errors. You will find the gory details of this technique in this document, on page 138.

EDIT:

After consulting Hacker's Delight (an excellent book, BTW) on the subject, it seems that you can make it even faster by exploiting the fact that all divisions in your code are exact (i.e. remainder is zero).

It seems that for every divisor d which is odd and base B = 2word_size, there exists a unique multiplicative inverse d⃰ which satisfies the conditions: d⃰ < B and d·d⃰ ≡ 1 (mod B). For every x which is an exact multiple of d, this implies x/d ≡ x·d⃰ (mod B). Which means you can simply replace a division with a multiplication, no added corrections, checks, rounding problems, whatever. (The proofs of these theorems can be found in the book.) Note that this multiplicative inverse need not be equal to the reciprocal as defined by the previous method!

How to check whether a given x is an exact multiple of d - i.e. x mod d = 0 ? Easy! x mod d = 0 iff x·d⃰ mod B ≤ ⌊(B-1)/d⌋. Note that this upper limit can be precomputed.

So, in code:

unsigned x, d;
unsigned inv_d = mulinv(d);          //precompute this!
unsigned limit = (unsigned)-1 / d;   //precompute this!

unsigned q = x*inv_d;
if(q <= limit)
{
   //x % d == 0
   //q == x/d
} else {
   //x % d != 0
   //q is garbage
}

Assuming the pr[] array becomes an array of struct prime:

struct prime {
   ulong p;
   ulong inv_p;  //equal to mulinv(p)
   ulong limit;  //equal to (ulong)-1 / p
}

the while(smallprimes) loop in your code becomes:

while (smallprimes) {
    ulong bit = smallprimes & (-smallprimes);
    int bit_ix = __builtin_ffsll(bit);
    ulong p = pr[bit_ix].p;
    ulong inv_p = pr[bit_ix].inv_p;
    ulong limit = pr[bit_ix].limit;
    nu = minuu(nu, vals(p - 1));
    prod *= ugcd(nn, star(p));
    n *= inv_p;
    for(;;) {
        ulong q = n * inv_p;
        if (q > limit)
            break;
        n = q;
    }
    smallprimes ^= bit;
}

And for the mulinv() function:

ulong mulinv(ulong d) //d needs to be odd
{
   ulong x = d;
   for(;;)
   {
      ulong tmp = d * x;
      if(tmp == 1)
         return x;
      x *= 2 - tmp;
   }
}

Note you can replace ulong with any other unsigned type - just use the same type consistently.

The proofs, whys and hows are all available in the book. A heartily recommended read :-).

Java compile speed vs Scala compile speed

22 votes

I've been programming in Scala for a while and I like it but one thing I'm annoyed by is the time it takes to compile programs. It's seems like a small thing but with Java I could make small changes to my program, click the run button in netbeans, and BOOM, it's running, and over time compiling in scala seems to consume a lot of time. I hear that with many large projects a scripting language becomes very important because of the time compiling takes, a need that I didn't see arising when I was using Java.

But I'm coming from Java which as I understand it, is faster than any other compiled language, and is fast because of the reasons I switched to Scala(It's a very simple language).

So I wanted to ask, can I make Scala compile faster and will scalac ever be as fast as javac.

The Scala compiler is more sophisticated than Java's, providing type inference, implicit conversion, and a much more powerful type system. These features don't come for free, so I wouldn't expect scalac to ever be as fast as javac. This reflects a trade-off between the programmer doing the work and the compiler doing the work.

That said, compile times have already improved noticeably going from Scala 2.7 to Scala 2.8, and I expect the improvements to continue now that the dust has settled on 2.8. This page documents some of the ongoing efforts and ideas to improve the performance of the Scala compiler.

Is there a more efficient way to get the length of a 32bit integer in bytes?

14 votes

I'd like a shortcut for the following little function, where performance is very important (the function is called more than 10.000.000 times):

inline int len(uint32 val)
{
    if(val <= 0x000000ff) return 1;
    if(val <= 0x0000ffff) return 2;
    if(val <= 0x00ffffff) return 3;
    return 4;
} 

Does anyone have any idea... a cool bitoperation trick? Thanks for your help in advance!

How about this one?

inline int len(uint32 val)
{
    return 4
        - ((val & 0xff000000) == 0)
        - ((val & 0xffff0000) == 0)
        - ((val & 0xffffff00) == 0)
    ;
}

Removing the inline keyword, g++ -O2 compiles this to the following branchless code:

movl    8(%ebp), %edx
movl    %edx, %eax
andl    $-16777216, %eax
cmpl    $1, %eax
sbbl    %eax, %eax
addl    $4, %eax
xorl    %ecx, %ecx
testl   $-65536, %edx
sete    %cl
subl    %ecx, %eax
andl    $-256, %edx
sete    %dl
movzbl  %dl, %edx
subl    %edx, %eax

If you don't mind machine-specific solutions, you can use the bsr instruction which searches for the first 1 bit. Then you simply divide by 8 to convert bits to bytes and add 1 to shift the range 0..3 to 1..4:

int len(uint32 val)
{
    asm("mov 8(%ebp), %eax");
    asm("or  $255, %eax");
    asm("bsr %eax, %eax");
    asm("shr $3, %eax");
    asm("inc %eax");
    asm("mov %eax, 8(%ebp)");
    return val;
}

Note that I am not an inline assembly god, so maybe there's a better to solution to access val instead of addressing the stack explicitly. But you should get the basic idea.

The GNU compiler also has an interesting built-in function called __builtin_clz:

inline int len(uint32 val)
{
    return ((__builtin_clz(val | 255) ^ 31) >> 3) + 1;
}

This looks much better than the inline assembly version to me :)

Performance differences between Python and C

11 votes

Working on different projects I have the choice of selecting different programming languages, as long as the task is done.

I was wondering what the real difference is, in terms of performance, between writing a program in Python, versus doing it in C.

The tasks to be done are pretty varied, e.g. sorting textfiles, disk access, network access, textfile parsing.

Is there really a noticeable difference between sorting a textfile using the same algorithm in C versus Python, for example?

And in your experience, given the power of current CPU's (i7), is it really a noticeable difference (Consider that its a program that doesnt bring the system to its knees).

Thanks! :)

Use python until you have a performance problem. If you ever have one figure out what the problem is (often it isn't what you would have guessed up front). Then solve that specific performance problem which will likely be an algorithm or data structure change. In the rare case that your problem really needs C then you can write just that portion in C and use it from your python code.

10 votes

This is a duplicate from UI.StackExchange.com:
http://ui.stackexchange.com/questions/1004/mixing-percent-and-fixed-css

Should you ever apply percentage and fixed CSS together? Will it cause problems, and if so what kinds?

  • Does mixing degrade browser render performance?
  • Will mixing give you weird results on initial load with progressive rendering browsers?

Below is just a dumbed-down example of mixed usage, it could be any mixture. I am not looking for validation of the example. I have heard you should never do what I have in the example below, so I am trying to find out if using CSS in this manner is an issue.

Example mix usage:

<style>
.container
{
    width:300px;
}
.cell
{
    width:25%;
}
</style>

<table class="container">
     <tr>
        <td class="cell"><td>
        <td class="cell"><td>
        <td class="cell"><td>
        <td class="cell"><td>
     </tr>
</table>

+1 Good question. You may want to have a look at this article: "Fixed-width, liquid, and elastic layout" It goes over fixed width layout (em) and elastic layouts (%), and if you click to go to the next page it looks at 'Elastic-liquid hybrid' - where width: is set one way, with max-width: set the other. I know the article linked to above isn't exactly what you asked, but it's an example of mixed use within a single CSS style.


Edit: After some further reading I did find a quite a few contradictory opinions on the subject. I found several articles that held the idea that "you just can’t mix up pixels and percentages". Though, for the most part, these sites were fairly dated. When I narrowed the search to only articles that have been put up within the past year, things changed a bit. There were still a few opinions against mixing, but they typically didn't explain why, and seemed to of the "I always heard it was a bad idea" variety. The majority of more recent information that I've found on the topic seems to indicate that mixing percentage with fixed widths is a perfectly acceptable practice, as long as it's done with an understanding of the results.

see:

Full Disclosure: I've been a mixer for many years, without really knowing whether my approach was 'correct.'

Detect how much stress the mysql database is currently under with PHP

8 votes

I am developing a large web app and want it to alter itself dependent on a factor that relates to the stress the database is currently under.

I am not sure what would me most accurate/effective/easiest. I am considering maybe the number of current connections or server response time or CPU useage?

What would be best suited and possible?

Thanks

Interesting question. What you REALLY want is a way for PHP to ask the mySQL server two questions:

server, are you using almost all your cpu capacity?

server, are you using almost all your disk IO capacity?

Based on the answers, I suppose you want to simplify the work your PHP web app does ... perhaps by eliminating some kind of search capability, or caching some data more aggressively.

If you have a shell to your (linux or bsd) mysql server, your two questions can be answered by eyeballing the output from these two commands.

   sar -u 1 10     # the %idle column tells you about unused cpu cycles
   sar -d 1 10     # the %util column tells you which disks are busy and how busy.

But, there's no sweet little query which fetches this data from mySQL to your app.

Edit: one possibility is to write a little PERL hack or other simple program that runs on your server, connects to the local data base, and once every so often (once a minute, maybe) determines %idle and %util, and updates a little one-row table in your data base. You could, without too much trouble, also add stuff like how full your disks are, to this table, if you care. Then your PHP app can query this table. This is an ideal use of the MEMORY access method. At any rate, keep it simple: you don't want your monitoring to weigh down your server.

A second-best trick, that you CAN do from your client.

Issue the command SHOW PROCESSLIST FULL, count the number of rows (mySQL processes) for which the Command is "Query", and if you have a lot of them consider it to be a high workload.

You might also add up the Time values for the processes which have status Query, and use a high value of that time as a threshold.

EDIT: if you're running on a mySQL 5 server, and your server account has access to the mySql-furnished information_schema, you can use a query directly to get the process data I mentioned:

SELECT (COUNT(*)-1) P.QUERYCOUNT, SUM(P.TIME) QUERYTIME
FROM information_schema.PROCESSLIST P
WHERE P.COMMAND = 'Query'

COUNT(*) - 1: because the above query itself counts as a query.

You will need to fiddle with the threshold values to make this work right in production.

It's a good idea to have your PHP web app shed load when the data base server can't keep up. Still, a better idea is to identify your long-running queries and optimize them.

How to check which stored procedure is taking maximum time in sql server

6 votes

I want to know what are various methods by which I can monitor which of my procedure, query is taking more time on various components(CPU cycle, scan time etc.) than already set threshold value.

I want it to be logged as well. Whenever uses my site and calling some procedure. I want to make a log of all procedures crossing my threshold.

Is it possible to do it with sql queries or procedures. Do we have some procedures for this. Any sql tools or any external tool, can be paid(with trial) or free. I want to try them on my database.

You should be able to do this using Dynamic Management Views (DMVs) in particular you are probably going to be most interested in the exec_query_stats view which maintains execution statistics on all queries (CPU time, Physical / Logical reads etc...) grouped by execution plan.

Also see this excellent article which includes a sample query for viewing plan statistics, and goes into a lot more detail on the subject:

Finally, if you want to trace / record excessively long running queries, then you might want to consider leaving an SQL server profiler trace running at all times, with a filter on execution time set to some high figure (e.g. > 1000 ms). You can either use the SQL server profiler windows application, or you can create the trace using T-SQL have have it log to a table in the database instead:

This has the benefit of telling you exactly what query took exactly how long, when and what the parameters to that query were (holy SQL Batman!)

The performance implications of running this trace on loaded databases is in fact very small - I know of surprisingly critial applications which have these traces running as a matter of routine in order to be able to quickly diagnose performance issues (and it does help a lot). The key is in choosing a "large" execution time which is large enough to not swamp the log, yet small enough to pick up enough long running queries to be useful.

Another trick that has been used in the past when having performance issues was to leave an unfiltered SQL server trace running for a short period of time (1 min or so) on a loaded SQL server (it really does have surprisingly little effect, you just get swamped with logs)

I also heartily recommend the Microsoft SQL Server internals books on this subject - it is very technical, however its brilliant because it covers not only these sorts of diagnosis tools, but also what they actually mean

is it faster to check a Date for Null or a bit for = 1/0

6 votes

I'm just wondering what is it faster in sql

can have a column of Date and to check it for null or to have a Date and a bit and to check the bit for 1/0

is the bit going to be faster ?

In order to check that a column IS NULL SQL Server would actually just check a bit anyway. There is a NULL BITMAP stored for each row indicating whether each column contains a NULL or not.

Loadtesting ajax applications based on jQuery

5 votes

Does anyone have any experience in load testing ajax applications? specifically running jQuery as their javascript library?

Loadrunner and Neoload is the two of the load testing applications that works on the gui layer. But none of whom supports the jquery library.

As for the other load testing tools, like jmeter, grinder and other http layer tools. they just don't cut it as you need to maintain application logic in your tests.

Selenium however uses real browsers which combined with selenium grid could be an option (also look www.browsermob.com), but then again installing a testing environment locally requires a lot of hardware.

Last time I looked, both LoadRunner and NeoLoad work at the HTTP layer, not the GUI layer. Just because a tool works at the HTTP layer does not mean that you need to replicate application logic in the tests. Smarter tools, like Load Tester (from webperformance.com), can configure many testcases automagically with little user intervention. We use it frequently with AJAX sites, though I've not looked specifically at JQuery use cases.