Best performance questions in September 2010

std::vector is so much slower than plain arrays?

41 votes

I've always thought it's the general wisdom that std::vector is "implemented as an array," blah blah blah. Today I went down and tested it, seems to be not so:

Here's some test results:

UseArray completed in 2.619 seconds
UseVector completed in 9.284 seconds
UseVectorPushBack completed in 14.669 seconds
The whole thing completed in 26.591 seconds

That's about 3 - 4 times slower! Doesn't really justify for the "vector may be slower for a few nanosecs" comments.

And the code I used:

#include <cstdlib>
#include <vector>

#include <iostream>
#include <string>

#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>

class TestTimer
{
public:
    TestTimer(const std::string & name) : name(name),
        start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
    {
    }

    ~TestTimer()
    {
        using namespace std;
        using namespace boost;

        posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
        posix_time::time_duration d = now - start;

        cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
            " seconds" << endl;
    }

private:
    std::string name;
    boost::posix_time::ptime start;
};

struct Pixel
{
    Pixel()
    {
    }

    Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
    {
    }

    unsigned char r, g, b;
};

void UseVector()
{
    TestTimer t("UseVector");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
        pixels.resize(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }
    }
}

void UseVectorPushBack()
{
    TestTimer t("UseVectorPushBack");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        std::vector<Pixel> pixels;
            pixels.reserve(dimension * dimension);

        for(int i = 0; i < dimension * dimension; ++i)
            pixels.push_back(Pixel(255, 0, 0));
    }
}

void UseArray()
{
    TestTimer t("UseArray");

    for(int i = 0; i < 1000; ++i)
    {
        int dimension = 999;

        Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);

        for(int i = 0 ; i < dimension * dimension; ++i)
        {
            pixels[i].r = 255;
            pixels[i].g = 0;
            pixels[i].b = 0;
        }

        free(pixels);
    }
}

int main()
{
    TestTimer t1("The whole thing");

    UseArray();
    UseVector();
    UseVectorPushBack();

    return 0;
}

Am I doing it wrong or something? Or have I just busted this performance myth?

Edit: I'm using Release mode in MSVS2005.


In MSVC, #define _SECURE_SCL 0 reduces UseVector by half (bringing it down to 4 seconds). This is really huge IMO. Someone please upvote me a couple more times so this gets known by more folks :p

Using the following:

g++ -O3 Time.cpp -I <MyBoost>
./a.out
UseArray completed in 2.196 seconds
UseVector completed in 4.412 seconds
UseVectorPushBack completed in 8.017 seconds
The whole thing completed in 14.626 seconds

So array is twice as quick as vector.

But after looking at the code in more detail this is expected; as you run across the vector twice and the array only once. Note: when you resize() the vector you are not only allocating the memory but also running through the vector and calling the constructor on each member.

Re-Arranging the code slightly so that the vector only initializes each object once:

 std::vector<Pixel>  pixels(dimensions * dimensions, Pixel(255,0,0));

Now doing the same timing again:

g++ -O3 Time.cpp -I <MyBoost>
./a.out
UseVector completed in 2.216 seconds

The vector now performance only slightly worse than the array. IMO this difference is insignificant and could be caused by a whole bunch of things not associated with the test.

I would also take into account that you are not correctly initializing/Destroying the Pixel object in the UseArrray() method as neither constructor/destructor is not called (this may not be an issue for this simple class but anything slightly more complex (ie with pointers or members with pointers) will cause problems.

ASP.NET MVC 3 Razor performance

33 votes

Important Update: See update 5 at the bottom there is no performance issue in asp.net mvc 3, this is a benchmark issue

I've made a simple hello world project in asp.net mvc2,3 aspx and 3 razor and benchmarked them. What I see is:

asp.net mvc 2 aspx : 4200 request per second

asp.net mvc 3 aspx beta 1: 3200 request per second

asp.net mvc 3 razor beta 1: 1700 request per second

What's wrong with razor, it's so slow?

Update: I've redone the test. All 4 test virtual directories uses the same .net 4 integrated mode app pool. All projects are done with add new empty x project and adding 1 page with 1 line text and with no code. all sites are compiled in release mode. My system is windows 7, 4 gb i7 4 cores. I've run the test 2 times to warm up iis and these are second run results. apache bench parameters : ab -n100000 -c1000 results:

asp.net 4 - rps: 4780 cpu: %43

mcv 2 - rps: 4322 cpu: %58

mvc 3 beta 1 aspx - rps: 2324 cpu: %54

mvc 3 beta 1 razor - rps: 1615 cpu: %54

Update 2 Scott Guthrie answered in his blog:

We haven't fully optimized MVC3 yet (there is usually a lot of cache tuning we do). We expect razor to be the same performance as the .aspx view engine before it is finally released.

Update 3 Release Candidate 1 benchmark:

mvc 3 rc1 razor - rps: 1960 cpu: %54

Update 4 Release Candidate 2 benchmark:

mvc 3 rc2 razor - rps: 2187 cpu: %54

mvc 3 rc2 aspx - rps: 4014 cpu: %58

Update 5 all tests done in release mode but, the problem was debug="true" in my web.config file (that also effects release builds), after change it to false, issue fixed. And it's interesting how it's effecting only razor templating at this scale. This should be in our mind on deployments.

mvc 3 rc2 razor - rps: 3940 cpu: %58

mvc 3 rc2 aspx - rps: 4100 cpu: %58

Thanks to asp.net mvc team, excellent job!

(new answer to respond to your RC2 numbers)

Thanks for the updated numbers. A few points:

  1. Your Aspx numbers look good, in the sense that we expect MVC3 Aspx to be on par with MVC2 Aspx (a bit slower in such a Hello World example is expected)
  2. Your Razor numbers look suspect. We know Razor is a bit slower than the equivalent Aspx, however, the difference should be no larger than 5%-7%. Your numbers indicate 50% slower, which simply does not match up with our results. Check if the project compiles in Release and you have debug="false" set in web.config.
  3. Your CPU utilization is a bit suspect. With 1000 concurrent requests the CPU should be utlized 100%. (Even only 8 concurrent requests should be enough since you have 8 virtual cores)
  4. Your tests are running for about 20-25 seconds. That's a bit on the low side because a short (1-2 seconds) burst of activity elsewhere in the system could throw off your results quite significantly.
  5. Related to point 4, did you run each scenario once or a few times? Are you seeing much variance in the results? Since your OS is doing other things in the background it's typical to see different results between runs.

JavaScript variables declare outside or inside loop?

17 votes

In AS3 I believe you should initialise all variables outside loops for increased performance. Is this the case with JavaScript as well? Which is better / faster / best-practice?

var value = 0;

for (var i; i < 100; i++)
{
    value = somearray[i];
}

or

for (var i; i < 100; i++)
{
    var value = somearray[i];
}

There is absolutely no difference in meaning or performance, in JavaScript or ActionScript.

var is a directive for the parser, and not a command executed at run-time. If a particular identifier has been declared var once or more anywhere in a function body(*), then all use of that identifier in the block will be referring to the local variable. It makes no difference whether value is declared to be var inside the loop, outside the loop, or both.

Consequently you should write whichever you find most readable. I disagree with Crockford that putting all the vars at the top of a function is always the best thing. For the case where a variable is used temporarily in a section of code, it's better to declare var in that section, so the section stands alone and can be copy-pasted. Otherwise, copy-paste a few lines of code to a new function during refactoring, without separately picking out and moving the associated var, and you've got yourself an accidental global.

In particular:

for (var i; i<100; i++)
    do something;

for (var i; i<100; i++)
    do something else;

Crockford will recommend you remove the second var (or remove both vars and do var i; above), and jslint will whinge at you for this. But IMO it's more maintainable to keep both vars, keeping all the related code together, instead of having an extra, easily-forgotten bit of code at the top of the function.

Personally I tend to declare as var the first assignment of a variable in an independent section of code, whether or not there's another separate usage of the same variable name in some other part of the same function. For me, having to declare var at all is an undesirable JS wart (it would have been better to have variables default to local); I don't see it as my duty to duplicate the limitations of [an old revision of] ANSI C in JavaScript as well.

(*: other than in nested function bodies)

Fastest Way to Serve a File Using PHP

16 votes

I'm trying to put together a function that receives a file path, identifies what it is, sets the appropriate headers, and serves it just like Apache would.

The reason I am doing this is because I need to use PHP to process some information about the request before serving the file.

Speed is critical

virtual() isn't an option

Must work in a shared hosting environment where the user has no control of the web server (Apache/nginx, etc)

Here's what I've got so far:

File::output($path);

<?php
class File {
static function output($path) {
    // Check if the file exists
    if(!File::exists($path)) {
        header('HTTP/1.0 404 Not Found');
        exit();
    }

    // Set the content-type header
    header('Content-Type: '.File::mimeType($path));

    // Handle caching
    $fileModificationTime = gmdate('D, d M Y H:i:s', File::modificationTime($path)).' GMT';
    $headers = getallheaders();
    if(isset($headers['If-Modified-Since']) && $headers['If-Modified-Since'] == $fileModificationTime) {
        header('HTTP/1.1 304 Not Modified');
        exit();
    }
    header('Last-Modified: '.$fileModificationTime);

    // Read the file
    readfile($path);

    exit();
}

static function mimeType($path) {
    preg_match("|\.([a-z0-9]{2,4})$|i", $path, $fileSuffix);

    switch(strtolower($fileSuffix[1])) {
        case 'js' :
            return 'application/x-javascript';
        case 'json' :
            return 'application/json';
        case 'jpg' :
        case 'jpeg' :
        case 'jpe' :
            return 'image/jpg';
        case 'png' :
        case 'gif' :
        case 'bmp' :
        case 'tiff' :
            return 'image/'.strtolower($fileSuffix[1]);
        case 'css' :
            return 'text/css';
        case 'xml' :
            return 'application/xml';
        case 'doc' :
        case 'docx' :
            return 'application/msword';
        case 'xls' :
        case 'xlt' :
        case 'xlm' :
        case 'xld' :
        case 'xla' :
        case 'xlc' :
        case 'xlw' :
        case 'xll' :
            return 'application/vnd.ms-excel';
        case 'ppt' :
        case 'pps' :
            return 'application/vnd.ms-powerpoint';
        case 'rtf' :
            return 'application/rtf';
        case 'pdf' :
            return 'application/pdf';
        case 'html' :
        case 'htm' :
        case 'php' :
            return 'text/html';
        case 'txt' :
            return 'text/plain';
        case 'mpeg' :
        case 'mpg' :
        case 'mpe' :
            return 'video/mpeg';
        case 'mp3' :
            return 'audio/mpeg3';
        case 'wav' :
            return 'audio/wav';
        case 'aiff' :
        case 'aif' :
            return 'audio/aiff';
        case 'avi' :
            return 'video/msvideo';
        case 'wmv' :
            return 'video/x-ms-wmv';
        case 'mov' :
            return 'video/quicktime';
        case 'zip' :
            return 'application/zip';
        case 'tar' :
            return 'application/x-tar';
        case 'swf' :
            return 'application/x-shockwave-flash';
        default :
            if(function_exists('mime_content_type')) {
                $fileSuffix = mime_content_type($path);
            }
            return 'unknown/' . trim($fileSuffix[0], '.');
    }
}
}
?>

My previous answer was partial and not well documented, here is an update with a summary of the solutions from it and from others in the discussion.

The solutions are ordered from best solution to worst but also from the solution needing the most control over the web server to the one needing the less. There don't seem to be an easy way to have one solution that is both fast and work everywhere.


Using the X-SendFile header

As documented by others it's actually the best way. The basis is that you do your access control in php and then instead of sending the file yourself you tell the web server to do it.

The basic php code is :

header("X-Sendfile: $file_name");
header("Content-type: application/octet-stream");
header('Content-Disposition: attachment; filename="' . basename($file_name) . '"');

Where $file_name is the full path on the file system.

The main problem with this solution is that it need to be allowed by the web server and either isn't installed by default (apache), isn't active by default (lighttpd) or need a specific configuration (nginx).

Apache

Under apache if you use mod_php you need to install a module called mod_xsendfile then configure it (either in apache config or .htaccess if you allow it)

XSendFile on
XSendFilePath /home/www/example.com/htdocs/files/

With this module the file path could either be absolute or relative to the specified XSendFilePath.

Lighttpd

The mod_fastcgi support this when configured with

"allow-x-send-file" => "enable" 

The documentation for the feature is on the lighttpd wiki they document the X-LIGHTTPD-send-file header but the X-Sendfile name also work

Nginx

On Nginx you can't use the X-Sendfile header you must use their own header that is named X-Accel-Redirect. It is enabled by default and the only real difference is that it's argument should be an URI not a file system. The consequence is that you must define a location marked as internal in your configuration to avoid clients finding the real file url and going directly to it, their wiki contains a good explanation of this.

Symlinks and Location header

You could use symlinks and redirect to them, just create symlinks to your file with random names when an user is authorized to access a file and redirect the user to it using:

header("Location: " . $url_of_symlink);

Obviously you'll need a way to prune them either when the script to create them is called or via cron (on the machine if you have access or via some webcron service otherwise)

Under apache you need to be able to enable FollowSymLinks in a .htaccess or in the apache config.

Access control by IP and Location header

Another hack is to generate apache access files from php allowing the explicit user IP. Under apache it mean using mod_authz_host (mod_access) Allow from commands.

The problem is that locking access to the file (as multiple users may want to do this at the same time) is non trivial and could lead to some users waiting a long time. And you still need to prune the file anyway.

Obviously another problem would be that multiple people behind the same IP could potentially access the file.

When everything else fail

If you really don't have any way to get your web server to help you, the only solution remaining is readfile it's available in all php versions currently in use and work pretty well (but isn't really efficient).


Combining solutions

In fine, the best way to send a file really fast if you want your php code to be usable everywhere is to have a configurable option somewhere, with instructions on how to activate it depending on the web server and maybe an auto detection in your install script.

It is pretty similar to what is done in a lot of software for

  • Clean urls (mod_rewrite on apache)
  • Crypto functions (mcrypt php module)
  • Multibyte string support (mbstring php module)

Why does this simple mysql insert query take occasionally so long?

15 votes

Ok, I've got a real head scratcher... I'm going bald!

This is a pretty simple problem. Inserting data into the table normally works fine, except for a few times, the insert query takes a few seconds. This isn't very good, so I setup a simulation of the insert process. I am NOT trying to bulk insert data. I am trying to find out why the insert query occasionally takes more than 2 seconds to run. Joshua suggested that the index file may be being adjusted; I have removed the id (primary key field), but the delay still happens.

I have a MyISAM table: daniel_test_insert (this table starts COMPLETELY empty):

create table if not exists daniel_test_insert ( 
    id int unsigned auto_increment not null, 
    value_str varchar(255) not null default '', 
    value_int int unsigned default 0 not null, 
    primary key (id) 
)

I insert data into it, and sometimes, a insert query takes > 2 seconds to run. THERE ARE NO READS on this table. All writes, in serial, by a single threaded program.

This same row; 100,000 times. I run the exact same query 100,000 times, because once in a while the query takes a long time, and I'm trying to find out why. It appears to be a random occurrence so far though.

This query for example took 4.194 seconds (a very long time for an insert)

Query: INSERT INTO daniel_test_insert SET value_int=12345, value_str='afjdaldjsf aljsdfl ajsdfljadfjalsdj fajd as f' - ran for 4.194 seconds
status               | duration | cpu_user  | cpu_system | context_voluntary | context_involuntary | page_faults_minor
starting             | 0.000042 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
checking permissions | 0.000024 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
Opening tables       | 0.000024 | 0.001000  | 0.000000   | 0                 | 0                   | 0                
System lock          | 0.000022 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
Table lock           | 0.000020 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
init                 | 0.000029 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
update               | 4.067331 | 12.151152 | 5.298194   | 204894            | 18806               | 477995           
end                  | 0.000094 | 0.000000  | 0.000000   | 8                 | 0                   | 0                
query end            | 0.000033 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
freeing items        | 0.000030 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
closing tables       | 0.125736 | 0.278958  | 0.072989   | 4294              | 604                 | 2301             
logging slow query   | 0.000099 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
logging slow query   | 0.000102 | 0.000000  | 0.000000   | 7                 | 0                   | 0                
cleaning up          | 0.000035 | 0.000000  | 0.000000   | 7                 | 0                   | 0

This is an abbreviated version of the SHOW PROFILE command, I threw out the columns that were all zero.

Now the update has an incredible number of context switches and minor page faults.

Opened_Tables increases about 1 per 10 seconds on this database (not running out of table_cache space)

Stats:

MySQL 5.0.89

Hardware: 32 Gigs of ram / 8 cores @ 2.66GHz; raid 10 SCSI harddisks (SCSI II???) I have had the harddrives and raid controller queried: no errors are being reported. CPU's are about 50% idle.

iostat -x 5 (reports less than 10% utilization for harddisks) top report load average about 10 for 1 minute (normal for our db machine)

Swap space has 156k used (32 gigs of ram :)

I'm at a loss to find out what is causing this performance lag! Does anyone have any suggestions?

This does NOT happen on our low-load slaves, only on our high load master. This also happens with memory and innodb tables.

Warning: This is a production system, so nothing exotic!

-daniel (I'm going to have use my dogs hair for a tuopee!!!)

Updated: Sept 20th, 2010: I'm going bald!

I have noticed the same phenomenon on my systems. Queries which normally take a millisecond will suddenly take 1-2 seconds. All of my cases are simple, single table INSERT/UPDATE/REPLACE statements --- not on any SELECTs. No load, locking, or thread build up is evident.

I had suspected that it's due to clearing out dirty pages, flushing changes to disk, or some hidden mutex, but I have yet to narrow it down.

Also Ruled Out

  • Server load -- no correlation with high load
  • Engine -- happens with InnoDB/MyISAM/Memory
  • MySQL Query Cache -- happens whether it's on or off
  • Log rotations -- no correlation in events

The only other observation I have at this point is derived from the fact I'm running the same db on multiple machines. I have a heavy read application so I'm using an environment with replication -- most of the load is on the slaves. I've noticed that even though there is minimal load on the master, the phenomenon occurs more there. Even though I see no locking issues, maybe it's Innodb/Mysql having trouble with (thread) concurrency? Recall that the updates on the slave will be single threaded.

MySQL Verion 5.1.48

Update

I think I have a lead for the problem on my case. On some of my servers, I noticed this phenomenon on more than the others. Seeing what was different between the different servers, and tweaking things around, I was lead to the MySQL innodb system variable innodb_flush_log_at_trx_commit.

I found the doc a bit awkward to read, but innodb_flush_log_at_trx_commit can take the values of 1,2,0:

  • For 1, the log buffer is flushed to the log file for every commit, and the log file is flushed to disk for every commit.
  • For 2, the log buffer is flushed to the log file for every commit, and the log file is flushed to disk approximately every 1-2 seconds.
  • For 0, the log buffer is flushed to the log file every second, and the log file is flushed to disk every second.

Effectively, in the order (1,2,0), as reported and documented, you're supposed to get with increasing performance in trade for increased risk.

Having said that, I found that the servers with innodb_flush_log_at_trx_commit=0 were performing worse (i.e. having 10-100 times more "long updates") than the servers with innodb_flush_log_at_trx_commit=2. Moreover, things immediately improved on the bad instances when I switched it to 2 (note you can change it on the fly).

So, my question is, what is yours set to? Note that I'm not blaming this parameter, but rather highlighting that it's context is related to this issue.

How to improve the performance of this Haskell program?

13 votes

I'm working through the problems in Project Euler as a way of learning Haskell, and I find that my programs are a lot slower than a comparable C version, even when compiled. What can I do to speed up my Haskell programs?

For example, my brute-force solution to Problem 14 is:

import Data.Int
import Data.Ord
import Data.List

searchTo = 1000000

nextNumber :: Int64 -> Int64
nextNumber n
    | even n    = n `div` 2
    | otherwise = 3 * n + 1

sequenceLength :: Int64 -> Int
sequenceLength 1 = 1
sequenceLength n = 1 + (sequenceLength next)
    where next = nextNumber n

longestSequence = maximumBy (comparing sequenceLength) [1..searchTo]

main = putStrLn $ show $ longestSequence

Which takes around 220 seconds, while an "equivalent" brute-force C version only takes 1.2 seconds.

#include <stdio.h>

int main(int argc, char **argv)
{
    int longest = 0;
    int terms = 0;
    int i;
    unsigned long j;

    for (i = 1; i <= 1000000; i++)
    {
        j = i;
        int this_terms = 1;

        while (j != 1)
        {
            this_terms++;

            if (this_terms > terms)
            {
                terms = this_terms;
                longest = i;
            }

            if (j % 2 == 0)
                j = j / 2;
            else
                j = 3 * j + 1;
        }
    }

    printf("%d\n", longest);
    return 0;
}

What am I doing wrong? Or am I naive to think that Haskell could even approach C's speed?

(I'm compiling the C version with gcc -O2, and the Haskell version with ghc --make -O).

For testing purpose I have just set searchTo = 100000. The time taken is 7.34s. A few modification leads to some big improvement:

  1. Use an Integer instead of Int64. This improves the time to 1.75s.

  2. Use an accumulator (you don't need sequenceLength to be lazy right?) 1.54s.

    seqLen2 :: Int -> Integer -> Int
    seqLen2 a 1 = a
    seqLen2 a n = seqLen2 (a+1) (nextNumber n)
    
    
    sequenceLength :: Integer -> Int
    sequenceLength = seqLen2 1
    
  3. Rewrite the nextNumber using quotRem, thus avoiding computing the division twice (once in even and once in div). 1.27s.

    nextNumber :: Integer -> Integer
    nextNumber n 
        | r == 0    = q
        | otherwise = 6*q + 4
        where (q,r) = quotRem n 2 
    
  4. Use Schwartzian transform instead of maximumBy. The problem of maximumBy . comparing is that the sequenceLength function is called more than once for each value. 0.32s.

    longestSequence = snd $ maximum [(sequenceLength a, a) | a <- [1..searchTo]]
    

Note:

  • I check the time by compiling with ghc -O and run with +RTS -s)
  • My machine is running on Mac OS X 10.6. The GHC version is 6.12.2. The compiled file is in i386 architecture.)
  • The C problem runs at 0.078s with the corresponding parameter. It is compiled with gcc -O3 -m32.