Best performance questions in July 2012

Writing a binary file in C++ very fast

81 votes

I'm trying to write huge amounts of data onto my SSD(solid state drive). And by huge amounts I mean 80GB.

I browsed the web for solutions, but the best I came up with was this:

#include <fstream>
using namespace std;
const unsigned long long size = 64ULL*1024ULL*1024ULL;
unsigned long long a[size];
int main()
{
    fstream myfile;
    myfile = fstream("file.binary", ios::out | ios::binary);
    //Here would be some error handling
    for(int i = 0; i < 32; ++i){
        //Some calculations to fill a[]
        myfile.write((char*)&a,size*sizeof(unsigned long long));
    }
    myfile.close();
}

Compiled with Visual Studio 2010 and full optimizations and run under Windows7 this program maxes out around 20MB/s. What really bothers me is that Windows can copy files from an other SSD to this SSD at somewhere between 150MB/s and 200MB/s. So at least 7 times faster. That's why I think I should be able to go faster.

Any ideas how I can speed up my writing?

This did the job:

#include <stdio.h>
const unsigned long long size = 8ULL*1024ULL*1024ULL;
unsigned long long a[size];

int main()
{
    FILE* pFile;
    pFile = fopen("file.binary", "wb");
    for (unsigned long long j = 0; j < 1024; ++j){
        //Some calculations to fill a[]
        fwrite(a, 1, size*sizeof(unsigned long long), pFile);
    }
    fclose(pFile);
    return 0;
}

I just timed 8GB in 36sec, which is about 220MB/s and I think that maxes out my SSD. Also worth to note, the code at the beginning of my post used one core 100%, whereas this code only uses 2-5%.

Thanks a lot to everyone.

Remove first line from a file

22 votes

Possible Duplicate:
Removing the first line of a text file in C#

What would be the fastest and smartest way to remove the first line from a huge (think 2-3 GB) file?

  • I think, that you probably can't avoid rewriting the whole file chunk-by-chunk, but I might be wrong.

  • Could using memory-mapped files somehow help to solve this issue?

  • Is it possible to achieve this behavior by operating directly on the file system (NTFS, for example) - say, update the corresponding inode data and change the file starting sector, so that the first line is ignored? If yes, would this approach be really fragile or there are many other applications, except the OS itself that do something similiar?

NTFS by default on most volumes (but importantly not all!) stores data in 4096 byte chunks. These are referenced by the $MFT record, which you cannot edit directly because it's disallowed by the Operating System (for reasons of sanity). As a result, there is no trick available to operate on the filesystem to do something approaching what you want (in other words, you cannot directly reverse truncate a file on NTFS, even in filesystem chunk sized amounts.)

Because of the way files are stored in a filesystem, the only answer is that you must rewrite the entire file directly. Or figure out a different way to store your data. a 2-3GB file is massive and crazy, especially considering you referred to lines meaning that this data is at least in part text information.

You should look into putting this data into a database perhaps? Or organizing it a bit more efficiently at the very least.

List indexing efficiency (python 2 vs python 3)

20 votes

In answering another question, I suggested to use timeit to test the difference between indexing a list with positive integers vs. negative integers. Here's the code:

import timeit
t=timeit.timeit('mylist[99]',setup='mylist=list(range(100))',number=10000000)
print (t)
t=timeit.timeit('mylist[-1]',setup='mylist=list(range(100))',number=10000000)
print (t)

I ran this code with python 2.6:

$ python2.6 test.py
0.587687015533
0.586369991302

Then I ran it with python 3.2:

$ python3.2 test.py
0.9212150573730469
1.0225799083709717

Then I scratched my head, did a little google searching and decided to post these observations here.

Operating system: OS-X (10.5.8) -- Intel Core2Duo

That seems like a pretty significant difference to me (a factor of over 1.5 difference). Does anyone have an idea why python3 is so much slower -- especially for such a common operation?

EDIT

I've run the same code on my Ubuntu Linux desktop (Intel i7) and achieved comparable results with python2.6 and python 3.2. It seems that this is an issue which is operating system (or processor) dependent (Other users are seeing the same behavior on Linux machines -- See comments).

EDIT 2

The startup banner was requested in one of the answers, so here goes:

Python 2.6.4 (r264:75821M, Oct 27 2009, 19:48:32) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin

and:

Python 3.2 (r32:88452, Feb 20 2011, 10:19:59) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin

UPDATE

I've just installed fresh versions of python2.7.3 and python3.2.3 from http://www.python.org/download/

In both cases, I took the

"Python x.x.3 Mac OS X 32-bit i386/PPC Installer (for Mac OS X 10.3 through 10.6 [2])"

since I am on OS X 10.5. Here are the new timings (which are reasonably consistent through multiple trials):

python 2.7

$python2.7 test.py
0.577006101608
0.590042829514

python 3.2.3

$python3.2 test.py
0.8882801532745361
1.034242868423462

This appears to be an artifact of some builds of Python 3.2. The best hypothesis at this point is that all 32-bit Intel builds have the slowdown, but no 64-bit ones do. Read on for further details.

You didn't run nearly enough tests to determine anything. Repeating your test a bunch of times, I got values ranging from 0.31 to 0.54 for the same test, which is a huge variation.

So, I ran your test with 10x the number, and repeat=10, using a bunch of different Python2 and Python3 installs. Throwing away the top and bottom results, averaging the other 8, and dividing by 10 (to get a number equivalent to your tests), here's what I saw:

 1. 0.52/0.53 Lion 2.6
 2. 0.49/0.50 Lion 2.7
 3. 0.48/0.48 MacPorts 2.7
 4. 0.39/0.49 MacPorts 3.2
 5. 0.39/0.48 HomeBrew 3.2

So, it looks like 3.2 is actually slightly faster with [99], and about the same speed with [-1].

However, on a 10.5 machine, I got these results:

 1. 0.98/1.02 MacPorts 2.6
 2. 1.47/1.59 MacPorts 3.2

Back on the original (Lion) machine, I ran in 32-bit mode, and got this:

 1. 0.50/0.48 Homebrew 2.7
 2. 0.75/0.82 Homebrew 3.2

So, it seems like 32-bitness is what matters, and not Leopard vs. Lion, gcc 4.0 vs. gcc 4.2 or clang, hardware differences, etc. It would help to test 64-bit builds under Leopard, with different compilers, etc., but unfortunately my Leopard box is a first-gen Intel Mini (with a 32-bit Core Solo CPU), so I can't do that test.

As further circumstantial evidence, I ran a whole slew of other quick tests on the Lion box, and it looks like 32-bit 3.2 is ~50% slower than 2.x, while 64-bit 3.2 is maybe a little faster than 2.x. But if we really want to back that up, someone needs to pick and run a real benchmark suite.

Anyway, my best guess at this point is that when optimizing the 3.x branch, nobody put much effort into 32-bit i386 Mac builds. Which is actually a reasonable choice for them to have made.

Or, alternatively, they didn't even put much effort into 32-bit i386 period. That possibility might explain why the OP saw 2.x and 3.2 giving similar results on a linux box, while Otto Allmendinger saw 3.2 being similarly slower to 2.6 on a linux box. But since neither of them mentioned whether they were running 32-bit or 64-bit linux, it's hard to know whether that's relevant.

There are still lots of other different possibilities that we haven't ruled out, but this seems like the best one.

Why are standard string functions faster than my custom string functions?

20 votes

I decided to find the speeds of 2 functions :

  • strcmp - The standard comparison function defined in string.h
  • xstrcmp- A function that has same parameters and does the same, just that I created it.

Here is my xstrcmp function :

int xstrlen(char *str)
{
    int i;
    for(i=0;;i++)
    {
        if(str[i]=='\0')
            break;
    }
    return i;
}

int xstrcmp(char *str1, char *str2)
{
    int i, k;
    if(xstrlen(str1)!=xstrlen(str2))
        return -1;
    k=xstrlen(str1)-1;
    for(i=0;i<=k;i++)
    {
        if(str1[i]!=str2[i])
            return -1;
    }
    return 0;
}

I didn't want to depend on strlen, since I want everything user-defined.

So, I found the results. strcmp did 364 comparisons per millisecond and my xstrcmp did just 20 comparisons per millisecond (atleast on my computer!)

Can anyone tell why this is so ? What does the xstrcmp function do to make itself so fast ?

if(xstrlen(str1)!=xstrlen(str2))    //computing length of str1
    return -1;                      
k=xstrlen(str1)-1;                  //computing length of str1 AGAIN!

You're computing the length of str1 TWICE. That is one reason why your function loses the game.

Also, your implemetation of xstrcmp is very naive compared to the ones defined in (most) Standard libraries. For example, your xstrcmp compares one byte at a time, when in fact it could compare multiple bytes in one go, taking advantage of proper alignment as well, or can do little preprocessing so as to align memory blocks, before actual comparison.

Fast merge of sorted subsets of 4K floating-point numbers in L1/L2

14 votes

What is a fast way to merge sorted subsets of an array of up to 4096 32-bit floating point numbers on a modern (SSE2+) x86 processor?

Please assume the following:

  • The size of the entire set is at maximum 4096 items
  • The size of the subsets is open to discussion, but let us assume between 16-256 initially
  • All data used through the merge should preferably fit into L1
  • The L1 data cache size is 32K. 16K has already been used for the data itself, so you have 16K to play with
  • All data is already in L1 (with as high degree of confidence as possible) - it has just been operated on by a sort
  • All data is 16-byte aligned
  • We want to try to minimize branching (for obvious reasons)

Main criteria of feasibility: faster than an in-L1 LSD radix sort.

I'd be very interested to see if someone knows of a reasonable way to do this given the above parameters! :)

Here's a very naive way to do it. (Please excuse any 4am delirium-induced pseudo-code bugs ;)

//4x sorted subsets
data[4][4] = {
  {3, 4, 5, INF},
  {2, 7, 8, INF},
  {1, 4, 4, INF},
  {5, 8, 9, INF}
}

data_offset[4] = {0, 0, 0, 0}

n = 4*3

for(i=0, i<n, i++):
  sub = 0
  sub = 1 * (data[sub][data_offset[sub]] > data[1][data_offset[1]])
  sub = 2 * (data[sub][data_offset[sub]] > data[2][data_offset[2]])
  sub = 3 * (data[sub][data_offset[sub]] > data[3][data_offset[3]])

  out[i] = data[sub][data_offset[sub]]
  data_offset[sub]++


Edit:
With AVX2 and its gather support, we could compare up to 8 subsets at once.


Edit 2:
Depending on type casting, it might be possible to shave off 3 extra clock cycles per iteration on a Nehalem (mul: 5, shift+sub: 4)

//Assuming 'sub' is uint32_t
sub = ... << ((data[sub][data_offset[sub]] > data[...][data_offset[...]]) - 1)


Edit 3:
It may be possible to exploit out-of-order execution to some degree, especially as K gets larger, by using two or more max values:

max1 = 0
max2 = 1
max1 = 2 * (data[max1][data_offset[max1]] > data[2][data_offset[2]])
max2 = 3 * (data[max2][data_offset[max2]] > data[3][data_offset[3]])
...
max1 = 6 * (data[max1][data_offset[max1]] > data[6][data_offset[6]])
max2 = 7 * (data[max2][data_offset[max2]] > data[7][data_offset[7]])

q = data[max1][data_offset[max1]] < data[max2][data_offset[max2]]

sub = max1*q + ((~max2)&1)*q

Does order of where clauses matter in SQL

9 votes

Let's say I have a table called PEOPLE having 3 columns ID, LastName, FirstName, none of these columns are indexed.
LastName is more unique, and FirstName is less unique.

If I do 2 searches:

select * from PEOPLE where FirstName="F" and LastName="L" 
select * from PEOPLE where LastName="L" and FirstName="F"

My belief is the second one is faster because the more unique criterion (LastName) comes first in the where clause, and records will get eliminated more efficiently. I don't think the optimizer is smart enough to optimize the first sql.

Is my understanding correct?

No, that order doesn't matter (or at least: shouldn't matter).

Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.

I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.

What does matter is whether or not you have a suitable index for this!

In the case of SQL Server, it will likely use an index if you have:

  • an index on (LastName, FirstName)
  • an index on (FirstName, LastName)
  • an index on just (LastName), or just (FirstName) (or both)

On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).

Can Indices actually decrease SELECT performance?

6 votes

Possible Duplicate:
Degraded performance of a query after adding Index

after reading some stuff about indices on SQL Server and their performance advantages for selects and disadvantages for updates / inserts, i was wondering if badly used indices could actually also hurt performance for selects. What conditions would have to be fulfilled to have an index decrease performance of a pure select query? Do such situations exist?

Thanks!

(although I always try to include code examples, i can't think of anything that would support this question...)

Yes, albeit very slightly - so slightly that it would be justified to also answer "No".

If you have an index which might be considered for a query, but is not useable, the optimizer will waste a short time pondering whether and how to use it (in rare cases with REALLY complicated indexes and views, and more frequently when index performance hints are wrong, you might end up choosing a suboptimal query plan).

Some cases would be:

  • a table without indexes
  • a table with a badly chosen index, which get discarded
  • a table where TWO indexes exist, and for some reason (e.g. obsolete statistics), the existence of the second index makes the optimizer choose it, while it would have been more convenient to use the first.

In the first two cases the query time is the same (and entails a full scan), but in the second, you also have to analyze and discard the index.

Where an index hurts you - where ALL indexes hurt you - is in inserts, deletes and updates. Then, any index not used by the update query, yet affected by same, will require a write to the index itself.

So you will want to have indexes, but as few as you can without sacrificing SELECT performances. Actually, you might decide against indexing for a rarely used SELECT query in order to avoid having the needed index constantly updated by all other UPDATE queries.

Edit: after reading Heinzi's answer, I'd also like to add that most DB servers have maintenance tools which analyze the tables and indexes (and sometimes query performance counters too), and properly update the hints of which Heinzi spoke. So it's also important to periodically "maintain" the database to keep the optimizer supplied with up-to-date information on which indexes to choose from.

SharedResourceDictionary: edit resource in Blend

5 votes

To optimize my application, I create a SharedResourceDictionary. With this, I save about 250 mb of memory at run-time, so it works well.

My problem is in design-time, in Blend 4, I have no more access to my resource. To modify a template (witch is in my resource file), I usually right click on my control and I choose Edit Template --> Edit Current, like in this image:

enter image description here

I need to modify my template that way, it's 100x faster then to go in my resource file and find the good one... I have A LOT of resources...

My SharedResourceDictionary is implemented like this (found on the web):

public class SharedResourceDictionary : ResourceDictionary
{
    /// <summary>
    /// Internal cache of loaded dictionaries 
    /// </summary>
    public static Dictionary<Uri, ResourceDictionary> _sharedDictionaries =
        new Dictionary<Uri, ResourceDictionary>();

    /// <summary>
    /// Local member of the source uri
    /// </summary>
    private Uri _sourceUri;

    /// <summary>
    /// Gets or sets the uniform resource identifier (URI) to load resources from.
    /// </summary>
    public new Uri Source
    {
        get
        {
            if (IsInDesignMode)
                return base.Source;

            return _sourceUri;
        }
        set
        {
            _sourceUri = value;

            if (!_sharedDictionaries.ContainsKey(value))
            {
                // If the dictionary is not yet loaded, load it by setting
                // the source of the base class
                base.Source = value;

                // add it to the cache
                _sharedDictionaries.Add(value, this);
            }
            else
            {
                // If the dictionary is already loaded, get it from the cache
                MergedDictionaries.Add(_sharedDictionaries[value]);
            }
        }
    }

    /// <summary>
    /// Check if we are in Blend to prevent error
    /// </summary>
    public bool IsInDesignMode
    {
        get
        {
            return
            (bool)
            DependencyPropertyDescriptor.FromProperty(DesignerProperties.IsInDesignModeProperty,
            typeof(DependencyObject)).Metadata.DefaultValue;
        }
    }
}

Someone have an idea if there is a solution for this? I really want to keep this shared dictionary due to the memory improvement, but I also want to modify my resource easily...

Any clue will be appreciated. Thank you.

EDIT:

Doing this (SharedResourceDictionary), I also have an issue with static resource:

Provide value on 'System.Windows.Markup.StaticResourceHolder' threw an exception.

And when I change them to dynamic resource it works, or if I change the SharedResourceDictionary by a simple ResourceDictionary it also works. The project can compile but I cannot edit the resource without a modification as I said.

Are you compiling for .NET 4?

You might be hitting the bug where it fails to scan your ResourceDictionaries properly.

There's talk that you have to add your ResourceDictionaries at the the App.xaml level (in Application.Resources) level, and add a dummy default style.

This might not work for you as you seem to be adding your resources in your control.

http://blogs.windowsclient.net/rob_relyea/archive/2010/04/26/my-staticresource-reference-worked-differently-with-wpf-4-rc-than-it-does-with-wpf-4-rtm.aspx

Adding a Merged Dictionary to a Merged Dictionary

Trouble referencing a Resource Dictionary that contains a Merged Dictionary

http://connect.microsoft.com/VisualStudio/feedback/details/553528/defaultstylekey-style-not-found-in-inner-mergeddictionaries

Connecting to the Database Perl

5 votes

I am connecting to a mysql database using the following code:

my $dbh = DBI->connect("DBI:mysql:test:localhost", $user, $pass)
    or die $DBI::errstr;
my $sqlQuery  = $dbh->prepare($query) 
    or die "Can't prepare $query: $dbh->errstr\n"; 
my $rv = $sqlQuery->execute 
    or die "can't execute the query: $sqlQuery->errstr";

while (my @row= $sqlQuery->fetchrow_array()) {
    # do something;
}

My doubt is: It is fine till the time my application is interacting with small DBs. But when I move this application to a live environment where the DB size may be in 100s of GBs, what performance issues can this code cause. Effectively what I am asking is, at the line -

@row= $sqlQuery->fetchrow_array();

Will Perl copy the entire table contents and dump it into the variable. If yes, won't it cause significant performance issues for my application as well as the database server?

In the line:

@row= $sqlQuery->fetchrow_array();

One row ata time will be returned by the database to perl, if interacting with a massive database you will not dump the entire result set of the query to a variable.

$arrRef = $sqlQuery->fetchall_arrayref();

On the other hand..