Best performance questions in September 2011

Does the order of LINQ functions matter?

50 votes

Basically, as the question states... does the order of LINQ functions matter in terms of performance? Obviously the results would have to be identical still...

Example:

myCollection.OrderBy(item => item.CreatedDate).Where(item => item.Code > 3);
myCollection.Where(item => item.Code > 3).OrderBy(item => item.CreatedDate);

Both return me the same results, but are in a different LINQ order. I realize that reordering some items will result in different results, and I'm not concerned about those. What my main concern is in knowing if, in getting the same results, ordering can impact performance. And, not just on the 2 LINQ calls I made (OrderBy, Where), but on any LINQ calls.

It will depend on the LINQ provider in use. For LINQ to Objects, that could certainly make a huge difference. Assume we've actually got:

var query = myCollection.OrderBy(item => item.CreatedDate)
                        .Where(item => item.Code > 3);

var result = query.Last();

That requires the whole collection to be sorted and then filtered. If we had a million items, only one of which had a code greater than 3, we'd be wasting a lot of time ordering results which would be thrown away.

Compare that with the reversed operation, filtering first:

var query = myCollection.Where(item => item.Code > 3)
                        .OrderBy(item => item.CreatedDate);

var result = query.Last();

This time we're only ordering the filtered results, which in the sample case of "just a single item matching the filter" will be a lot more efficient - both in time and space.

It also could make a difference in whether the query executes correctly or not. Consider:

var query = myCollection.Where(item => item.Code != 0)
                        .OrderBy(item => 10 / item.Code);

var result = query.Last();

That's fine - we know we'll never be dividing by 0. But if we perform the ordering before the filtering, the query will throw an exception.

33 votes

I've recently seen two really nice and educating languages talks:

This first one by Herb Sutter, presents all the nice and cool features of C++0x, why C++'s future seems brighter than ever, and how M$ is said to be a good guy in this game. The talk revolves around efficiency and how minimizing heap activity very often improves performance.

This other one, by Andrei Alexandrescu, motivates a transition from C/C++ to his new game-changer D. Most of D's stuff seems really well motivated and designed. One thing, however, surprised me, namely that D pushes for garbage collection and that all classes are created solely by reference. Even more confusing, the book The D Programming Language Ref Manual specifically in the section about Resource Management states the following, quote:

Garbage collection eliminates the tedious, error prone memory allocation tracking code necessary in C and C++. This not only means much faster development time and lower maintenance costs, but the resulting program frequently runs faster!

This conflicts with Sutter's constant talk about minimizing heap activity. I strongly respect both Sutter's and Alexandrescou's insights, so I feel a bit confused about these two key questions

  1. Doesn't creating class instances solely by reference result in a lot of unnecesseary heap activity?

  2. In which cases can we use Garbage Collection without sacrificing run-time performance?

To directly answer your two questions:

  1. Yes, creating class instances by reference does result in a lot of heap activity, but:

    a. In D, you have struct as well as class. A struct has value semantics and can do everything a class can, except polymorphism.

    b. Polymorphism and value semantics have never worked well together due to the slicing problem.

    c. In D, if you really need to allocate a class instance on the stack in some performance-critical code and don't care about the loss of safety, you can do so without unreasonable hassle via the scoped function.

  2. GC can be comparable to or faster than manual memory management if:

    a. You still allocate on the stack where possible (as you typically do in D) instead of relying on the heap for everything (as you often do in other GC'd languages).

    b. You have a top-of-the-line garbage collector (D's current GC implementation is admittedly somewhat naive, though it has seen some major optimizations in the past few releases, so it's not as bad as it was).

    c. You're allocating mostly small objects. If you allocate mostly large arrays and performance ends up being a problem, you may want to switch a few of these to the C heap (you have access to C's malloc and free in D) or, if it has a scoped lifetime, some other allocator like RegionAllocator. (RegionAllocator is currently being discussed and refined for eventual inclusion in D's standard library).

    d. You don't care that much about space efficiency. If you make the GC run too frequently to keep the memory footprint ultra-low, performance will suffer.

Preserve code readability while optimising

15 votes

I am writing a scientific program in Python and C with some complex physical simulation algorithms. After implementing algorithm, I found that there are a lot of possible optimizations to improve performance. Common ones are precalculating values, getting calculations out of cycle, replacing simple matrix algorithms with more complex and other. But there arises a problem. Unoptimized algorithm is much slower, but its logic and connection with theory look much clearer and readable. Also, it's harder to extend and modify optimized algorithm.

So, the question is - what techniques should I use to keep readability while improving performance? Now I am trying to keep both fast and clear branches and develop them in parallel, but maybe there are better methods?

Just as a general remark (I'm not too familiar with Python): I would suggest you make sure that you can easily exchange the slow parts of the 'reference implementation' with the 'optimized' parts (e.g., use something like the Strategy pattern).

This will allow you to cross-validate the results of the more sophisticated algorithms (to ensure you did not mess up the results), and will keep the overall structure of the simulation algorithm clear (separation of concerns). You can place the optimized algorithms into separate source files / folders / packages and document them separately, in as much detail as necessary.

Apart from this, try to avoid the usual traps: don't do premature optimization (check if it is actually worth it, e.g. with a profiler), and don't re-invent the wheel (look for available libraries).

Speed of comparing to null vs undefined in JavaScript

14 votes

I have just run a very simple JavaScript performance test (don't ask why). The test declares a variable, but doesn't assign anything to it:

var x;

It then compares the speed of comparing the value variable to null, and to undefined, in other words:

var y = (x == null); and var y = (x == undefined);.

I was expecting the comparison with undefined to be the fasted. In fact it was nowhere near. The comparison with null was far and away the fastest, around 80% faster.

The results I've described above come from running the tests in Chrome (version 13). Running them in Firefox produces results far closer to what I would have expected (the comparison with undefined is faster than with null, albeit very marginally).

So, my question is what could the cause of this be? Why does Chrome seem to favour the comparison with null so greatly?

For quick reference, here's a screenshot of the results:

enter image description here

null is a reserved keyword which cannot be overriden, so when you are doing a comparison against null, all you have to do is a single comparison.

However, when you are checking against undefined, the engine must do a type lookup and than a comparison, meaning that it is actually slightly more demanding.


If you need to actually check to see if something is undefined, you should use

if(typeof notSet == "undefined"){ }

Proof

Try it... and set something to null in your JavaScript console.

null = "will error";
// Errors with --> ReferenceError: invalid assignment left-hand side

However, if you try and do it with undefined, it won't error. That is not to say that you can override undefined, because you can't, but that undefined is it's own primitive type.

The only real similarity between null and undefined, is that they can both be coerced into a boolean false.

Optimal performing query for latest record for each N

9 votes

Here is the scenario I find myself in.

I have a reasonably big table that I need to query the latest records from. Here is the create for the essential columns for the query:

CREATE TABLE [dbo].[ChannelValue](
   [ID] [bigint] IDENTITY(1,1) NOT NULL,
   [UpdateRecord] [bit] NOT NULL,
   [VehicleID] [int] NOT NULL,
   [UnitID] [int] NOT NULL,
   [RecordInsert] [datetime] NOT NULL,
   [TimeStamp] [datetime] NOT NULL
   ) ON [PRIMARY]
GO

The ID column is a Primary Key and there is a non-Clustered index on VehicleID and TimeStamp

CREATE NONCLUSTERED INDEX [IX_ChannelValue_TimeStamp_VehicleID] ON [dbo].[ChannelValue] 
(
    [TimeStamp] ASC,
    [VehicleID] ASC
)ON [PRIMARY]
GO

The table I'm working on to optimise my query is a little over 23 million rows and is only a 10th of the sizes the query needs to operate against.

I need to return the latest row for each VehicleID.

I've been looking through the responses to this question here on StackOverflow and I've done a fair bit of Googling and there seem to be 3 or 4 common ways of doing this on SQL Server 2005 and upwards.

So far the fastest method I've found is the following query:

SELECT cv.*
FROM ChannelValue cv
WHERE cv.TimeStamp = (
SELECT
    MAX(TimeStamp)
FROM ChannelValue
WHERE ChannelValue.VehicleID = cv.VehicleID
)

With the current amount of data in the table it takes about 6s to execute which is within reasonable limits but with the amount of data the table will contain in the live environment the query begins to perform too slow.

Looking at the execution plan my concern is around what SQL Server is doing to return the rows.

I cannot post the execution plan image because my Reputation isn't high enough but the index scan is parsing every single row within the table which is slowing the query down so much.

Execution Plan

I've tried rewriting the query with several different methods including using the SQL 2005 Partition method like this:

WITH cte
AS (
    SELECT *,
    ROW_NUMBER() OVER(PARTITION BY VehicleID ORDER BY TimeStamp DESC) AS seq
     FROM ChannelValue
)

SELECT
   VehicleID,
   TimeStamp,
   Col1
FROM cte
WHERE seq = 1

But the performance of that query is even worse by quite a large magnitude.

I've tried re-structuring the query like this but the result speed and query execution plan is nearly identical:

SELECT cv.*
FROM (
   SELECT VehicleID
    ,MAX(TimeStamp) AS [TimeStamp]
   FROM ChannelValue
   GROUP BY VehicleID
) AS [q]
INNER JOIN ChannelValue cv
   ON cv.VehicleID = q.VehicleID
   AND cv.TimeStamp = q.TimeStamp

I have some flexibility available to me around the table structure (although to a limited degree) so I can add indexes, indexed views and so forth or even additional tables to the database.

I would greatly appreciate any help at all here.

Edit Added the link to the execution plan image.

Depends on your data (how many rows are there per group?) and your indexes.

See Optimizing TOP N Per Group Queries for some performance comparisons of 3 approaches.

In your case with millions of rows for only a small number of Vehicles I would add an index on VehicleID, Timestamp and do

SELECT CA.*
FROM   Vehicles V
       CROSS APPLY (SELECT TOP 1 *
                    FROM   ChannelValue CV
                    WHERE  CV.VehicleID = V.VehicleID
                    ORDER  BY TimeStamp DESC) CA  

Should i resize the bitmap before adding to a ImageView or let the ImageView resize the Bitmap?

7 votes

i have a simple question: Should i resize a bigger bitmap before adding to a ImageView or let the ImageView resize the Bitmap? What's the right way, regarding performance?

Thanks

Consider using scale for ImageView, and don't bother about resizing. You can scale an image like this, for example:

image.setAdjustViewBounds(true);
image.setMaxHeight(50);
image.setMaxWidth(50);
image.setScaleType(ImageView.ScaleType.CENTER_INSIDE);

What runs faster in Ruby: defining the alias method or using alias_method?

7 votes

What is faster on later invocation:

def first_method?() second_method?() end

or

alias_method :first method, :second_method

and if possible why?

(NOTE: I don't ask what is nicer / better etc. -> only raw speed and why it is faster is interesting here)

At least in Ruby 1.8.6, aliasing seems to be faster:

#!/usr/local/bin/ruby

require 'benchmark'

$global_bool = true

class Object 
  def first_method?
    $global_bool
  end

  def second_method?
    first_method?
  end 

  alias_method :third_method?, :first_method?
end

Benchmark.bm(7) do |x|
  x.report("first:")  { 1000000.times { first_method?  }}
  x.report("second:") { 1000000.times { second_method? }}
  x.report("third:")  { 1000000.times { third_method?  }}
end

results in :

$ ./test.rb
             user     system      total        real
first:   0.281000   0.000000   0.281000 (  0.282000)
second:  0.469000   0.000000   0.469000 (  0.468000)
third:   0.281000   0.000000   0.281000 (  0.282000)

Obviously, you have one method call less (look-up receiver ...). So it seems natural for it to be faster.

SQL why is SELECT COUNT(*) , MIN(col), MAX(col) faster then SELECT MIN(col), MAX(col)

7 votes

We're seeing a huge difference between these queries.

The slow query

SELECT MIN(col) AS Firstdate, MAX(col) AS Lastdate 
FROM table WHERE status = 'OK' AND fk = 4193

Table 'table'. Scan count 2, logical reads 2458969, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 1966 ms, elapsed time = 1955 ms.

The fast query

SELECT count(*), MIN(col) AS Firstdate, MAX(col) AS Lastdate 
FROM table WHERE status = 'OK' AND fk = 4193

Table 'table'. Scan count 1, logical reads 5803, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 9 ms.

Question

What is the reason between the huge performance difference between the queries?

Update A little update based on questions given as comments:

The order of execution or repeated execution changes nothing performance wise. There are no extra parameters used and the (test)database is not doing anything else during execution.

Slow query

|--Nested Loops(Inner Join)
 |--Stream Aggregate(DEFINE:([Expr1003]=MIN([DBTest].[dbo].[table].[startdate])))
   |    |--Top(TOP EXPRESSION:((1)))
   |         |--Nested Loops(Inner Join, OUTER REFERENCES:([DBTest].[dbo].[table].[id], [Expr1008]) WITH ORDERED PREFETCH)
   |              |--Index Scan(OBJECT:([DBTest].[dbo].[table].[startdate]), ORDERED FORWARD)
   |              |--Clustered Index Seek(OBJECT:([DBTest].[dbo].[table].[PK_table]), SEEK:([DBTest].[dbo].[table].[id]=[DBTest].[dbo].[table].[id]),  WHERE:([DBTest].[dbo].[table].[FK]=(5806) AND [DBTest].[dbo].[table].[status]<>'A') LOOKUP ORDERED FORWARD)
   |--Stream Aggregate(DEFINE:([Expr1004]=MAX([DBTest].[dbo].[table].[startdate])))
        |--Top(TOP EXPRESSION:((1)))
             |--Nested Loops(Inner Join, OUTER REFERENCES:([DBTest].[dbo].[table].[id], [Expr1009]) WITH ORDERED PREFETCH)
                  |--Index Scan(OBJECT:([DBTest].[dbo].[table].[startdate]), ORDERED BACKWARD)
                  |--Clustered Index Seek(OBJECT:([DBTest].[dbo].[table].[PK_table]), SEEK:([DBTest].[dbo].[table].[id]=[DBTest].[dbo].[table].[id]),  WHERE:([DBTest].[dbo].[table].[FK]=(5806) AND [DBTest].[dbo].[table].[status]<>'A') LOOKUP ORDERED FORWARD)

Fast query

 |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1012],0)))
   |--Stream Aggregate(DEFINE:([Expr1012]=Count(*), [Expr1004]=MIN([DBTest].[dbo].[table].[startdate]), [Expr1005]=MAX([DBTest].[dbo].[table].[startdate])))
        |--Nested Loops(Inner Join, OUTER REFERENCES:([DBTest].[dbo].[table].[id], [Expr1011]) WITH UNORDERED PREFETCH)
             |--Index Seek(OBJECT:([DBTest].[dbo].[table].[FK]), SEEK:([DBTest].[dbo].[table].[FK]=(5806)) ORDERED FORWARD)
             |--Clustered Index Seek(OBJECT:([DBTest].[dbo].[table].[PK_table]), SEEK:([DBTest].[dbo].[table].[id]=[DBTest].[dbo].[table].[id]),  WHERE:([DBTest].[dbo].[table].[status]<'A' OR [DBTest].[dbo].[table].[status]>'A') LOOKUP ORDERED FORWARD)

The execution plan from SSMS

Answer

The answer given below by Martin Smith seems to explain the problem. The super short version is that the MS-SQL query-analyser wrongly uses a query plan in the slow query which causes a complete table scan.

Adding a Count(*), the query hint with(FORCESCAN) or a combined index on the startdate,FK and status columns fixes the performance issue.

There are 810,064 rows in the table.

You have the query

SELECT COUNT(*),
       MIN(startdate) AS Firstdate,
       MAX(startdate) AS Lastdate
FROM   table
WHERE  status <> 'A'
       AND fk = 4193 

1,893 (0.23%) rows meet the fk = 4193 predicate, and of those two fail the status <> 'A' part so overall 1,891 match and need to be aggregated.

You also have two indexes neither of which cover the whole query.

For your fast query it uses an index on fk to directly find rows where fk = 4193 then needs to do 1,893 key lookups to find each row in the clustered index to check the status predicate and retrieve the startdate for aggregation.

When you remove the COUNT(*) from the SELECT list SQL Server no longer has to process every qualifying row. As a result it considers another option.

You have an index on startdate so it could start scanning that from the beginning, doing key lookups back to the base table and as soon as it finds the first matching row stop as it has found the MIN(startdate), Similarly the MAX can be found with another scan starting the other end of the index and working backwards.

SQL Server estimates that each of these scans will end up processing 590 rows before they hit upon one that matches the predicate. Giving 1,180 total lookups vs 1,893 so it chooses this plan.

I'm not sure exactly what formula it uses to arrive at the 590 figure but from earlier numbers it can be seen that a row plucked at random has a 99.77% chance of not matching the where condition, so if you pick 590 random rows the chances of all of them not matching is approx 0.9977 ^ 590 or only around 25%

Unfortunately the 1,891 rows that meet the predicate are not randomly distributed with respect to startdate. In fact they are all condensed into a single 8,205 row segment towards the end of the index meaning that the scan to get to the MIN(startdate) ends up doing 801,859 key lookups before it can stop.

This can be reproduced below.

CREATE TABLE T
(
id int identity(1,1) primary key,
startdate datetime,
fk int,
[status] char(1),
Filler char(2000)
)

CREATE NONCLUSTERED INDEX ix ON T(startdate)

INSERT INTO T
SELECT TOP 810064 Getdate() - 1,
                  4192,
                  'B',
                  ''
FROM   sys.all_columns c1,
       sys.all_columns c2  


UPDATE T 
SET fk = 4193, startdate = GETDATE()
WHERE id BETWEEN 801859 and 803748 or id = 810064

UPDATE T 
SET  startdate = GETDATE() + 1
WHERE id > 810064


/*Both queries give the same plan. 
UPDATE STATISTICS T WITH FULLSCAN
makes no difference*/

SELECT MIN(startdate) AS Firstdate, 
       MAX(startdate) AS Lastdate 
FROM T
WHERE status <> 'A' AND fk = 4192


SELECT MIN(startdate) AS Firstdate, 
       MAX(startdate) AS Lastdate 
FROM T
WHERE status <> 'A' AND fk = 4193

You could consider using query hints to force the plan to use the index on fk rather than startdate or add the suggested missing index highlighted in the execution plan on (fk,status) INCLUDE (startdate) to avoid this issue.

Addition

Regarding the estimated rows and the 590 figure I've been doing some experiments where SQL Server has perfect statistics and it seems the formula it uses is actually very straight forward and is simply table_size / estimated_number_of_rows_that_match.

My repro above has estimated 428.379 rows and this is simply 810064 / 1891.

The one exception to the above seems to be that it will never come up with an estimated rows figure greater than 90% of the table even if the statistics indicate that no rows will match at all.

Your plan's exact figure was 589.643 which indicates then that it only expected 1,374 rows overall to match so again there might be a slight correlation where rows with fk = 4193 are somewhat more likely to also meet the status <> 'A' criteria than the average.

Why are annotations under Android such a performance problem?

6 votes

I'm the lead author of ORMLite which uses Java annotations on classes to build database schemas. A big startup performance problem for our package turns out to be the calling of annotation methods under Android 1.6. I see the same behavior if not worse up through 3.0.

We are seeing that the following simple annotation code is incredibly GC intensive and a real performance problem. 1000 calls to an annotation method takes almost a second on a fast box. The same code under Java can do 28 million (sic) calls in the same time. We have an annotation that has 25 methods in it and we'd like to do more than 50 of these a second.

Does anyone know why this is happening and if there is any work around? There are certainly things that ORMLite can do in terms of caching this information but is there anything that we can do to "fix" annotations under Android? Thanks.

public void testAndroidAnnotations() throws Exception {
    Field field = Foo.class.getDeclaredField("field");
    MyAnnotation myAnnotation = field.getAnnotation(MyAnnotation.class);
    long before = System.currentTimeMillis();
    for (int i = 0; i < 1000; i++)
        myAnnotation.foo();
    Log.i("test", "in " + (System.currentTimeMillis() - before) + "ms");
}
@Target(FIELD) @Retention(RUNTIME)
private static @interface MyAnnotation {
    String foo();
}
private static class Foo {
    @MyAnnotation(foo = "bar")
    String field;
}

This results in the following log output:

I/TestRunner(  895): started: testAndroidAnnotations
D/dalvikvm(  895): GC freed 6567 objects / 476320 bytes in 85ms
D/dalvikvm(  895): GC freed 8951 objects / 599944 bytes in 71ms
D/dalvikvm(  895): GC freed 7721 objects / 524576 bytes in 68ms
D/dalvikvm(  895): GC freed 7709 objects / 523448 bytes in 73ms
I/test    (  895): in 854ms

EDIT:

After @candrews pointed me in the right direction, I did some poking around the code. The performance problem looks to be caused by some terrible, gross code in Method.equals(). It is calling the toString() of both methods and then comparing them. Each toString() use StringBuilder with a bunch of append methods without a good initializing size. Actually doing the .equals by comparing fields would be 100+ times faster.

Google has acknowledged the issue and fixed it "post-Honeycomb"

https://code.google.com/p/android/issues/detail?id=7811

So at least they know about it and have supposedly fixed it for some future version.

Is divide slower than Multiply?

5 votes

Ok, this might sound like a strange question but it is an interesting one. I am coding for iOS and have been told that it is always best to multiply rather than divide values as it is faster.

I know that processors these days probably make this a non issue but my curiosity has gotten the better of me and I am wondering if anyone might be able to shed some light on this for me.

SO..... My question is this -
is:

player.position = ccp(player.contentSize.width / 2, winSize.height / 2);

slower than:

player.position = ccp(player.contentSize.width * 0.5, winSize.height * 0.5);

On most processors division is slower than multiplication for the same data types. In your example your multiplication is a floating point operation, if width and height are integer types, the result may be very different and may depend on both your processor and your compiler.

However most compilers (certainly GCC) will translate a division by a constant power-of-two as in your example, to a right-shift where that would be more efficient. That would generally be faster than either a multiply or divide.