Best performance questions in April 2011

In .NET, using "foreach" to iterate an instance of IEnumerable<ValueType> will create a copy? So should I prefer to use "for" instead of "foreach"?

18 votes

In .NET, using "foreach" to iterate an instance of IEnumerable will create a copy? So should I prefer to use "for" instead of "foreach"?

I wrote some code to testify this:

struct ValueTypeWithOneField
{
    private Int64 field1;
}

struct ValueTypeWithFiveField
{
    private Int64 field1;
    private Int64 field2;
    private Int64 field3;
    private Int64 field4;
    private Int64 field5;
}

public class Program
{
    static void Main(string[] args)
    {
        Console.WriteLine("one field");
        Test<ValueTypeWithOneField>();

        Console.WriteLine("-----------");

        Console.WriteLine("Five field");
        Test<ValueTypeWithFiveField>();

        Console.ReadLine();
    }

    static void Test<T>()
    {
        var test = new List<T>();
        for (int i = 0; i < 5000000; i++)
        {
            test.Add(default(T));
        }

        Stopwatch sw = new Stopwatch();

        for (int i = 0; i < 5; i++)
        {
            sw.Start();

            foreach (var item in test)
            {

            }

            sw.Stop();
            Console.WriteLine("foreach " + sw.ElapsedMilliseconds);
            sw.Restart();

            for (int j = 0; j < test.Count; j++)
            {
                T temp = test[j];
            }

            sw.Stop();
            Console.WriteLine("for " + sw.ElapsedMilliseconds);
            sw.Reset();
        }
    }}

And this is the result that I got after I ran the code:

    one field
    foreach 68
    for 72
    foreach 68
    for 72
    foreach 67
    for 72
    foreach 64
    for 73
    foreach 68
    for 72
    -----------
    Five field
    foreach 272
    for 193
    foreach 273
    for 191
    foreach 272
    for 190
    foreach 271
    for 190
    foreach 275
    for 188

As we can see in the result, "foreach" always takes more time than "for".

So should I prefer to use "for" instead of "foreach" when iterating through a generic collection of value type?

Note: thanks for the reminder, I edited the code and result. but still, foreach is running slower than for.

Your test is not accurate; in the foreach version, you're actually spinning up the enumerator and retrieving each value from the list (even though you aren't using it). In the for version, you aren't doing anything with the list at all, other than looking at its Count property. You're essentially testing the performance of an enumerator traversing a collection compared to incrementing an integer variable an equivalent number of times.

To create parity, you'd need to declare a temporary variable and assign it in each iteration of the for loop.

That being said, the answer to your question is yes. A copy of the value will be created with every assignment or return statement.

Performance

This pseudocode breakdown should explain why foreach is somewhat slower than using for in this particular instance:

foreach:

try
{
    var en = test.GetEnumerator(); //creates a ListEnumerator
    T item;

    while(en.MoveNext()) // MoveNext increments the current index and returns
                         // true if the new index is valid, or false if it's
                         // beyond the end of the list. If it returns true,
                         // it retrieves the value at that index and holds it 
                         // in an instance variable
    {
        item = en.Current; // Current retrieves the value of the current instance
                           // variable
    }
}
finally { }

for:

int index = -1;
T item;

while(++index < test.Count)
{
    item = test[index];
}

As you can see, there's simply less code in the for implementation, and foreach has a layer of abstraction (the enumerator) on top of the for. I wrote the for using a while loop to show the two versions in a similar representation.

With all that said...

You're talking about a trivial difference in execution time. Use the loop that makes the code clearer and smaller, and in this circumstance that looks like foreach.

How can assigning a variable result in a serious performance drop while the execution order is (nearly) untouched?

15 votes

When playing around with multithreading, I could observe some unexpected but serious performance issues related to AtomicLong (and classes using it, such as java.util.Random), for which I currently have no explanation. However, I created a minimalistic example, which basically consists of two classes: a class "Container", which keeps a reference to a volatile variable, and a class "DemoThread", which operates on an instance of "Container" during thread execution. Note that the references to "Container" and the volatile long are private, and never shared between threads (I know that there's no need to use volatile here, it's just for demonstration purposes) - thus, multiple instances of "DemoThread" should run perfectly parallel on a multiprocessor machine, but for some reason, they do not (Complete example is at the bottom of this post).

private static class Container  {

    private volatile long value;

    public long getValue() {
        return value;
    }

    public final void set(long newValue) {
        value = newValue;
    }
}

private static class DemoThread extends Thread {

    private Container variable;

    public void prepare() {
        this.variable = new Container();
    }

    public void run() {
        for(int j = 0; j < 10000000; j++) {
            variable.set(variable.getValue() + System.nanoTime());
        }
    }
}

During my test, I repeatedly create 4 DemoThreads, which are then started and joined. The only difference in each loop is the time when "prepare()" gets called (which is obviously required for the thread to run, as it otherwise would result in a NullPointerException):

DemoThread[] threads = new DemoThread[numberOfThreads];
    for(int j = 0; j < 100; j++) {
        boolean prepareAfterConstructor = j % 2 == 0;
        for(int i = 0; i < threads.length; i++) {
            threads[i] = new DemoThread();
            if(prepareAfterConstructor) threads[i].prepare();
        }

        for(int i = 0; i < threads.length; i++) {
            if(!prepareAfterConstructor) threads[i].prepare();
            threads[i].start();
        }
        joinThreads(threads);
    }

For some reason, if prepare() is executed immediately before starting the thread, it will take twice as more time to finish, and even without the "volatile" keyword, the performance differences were significant, at least on two of the machines and OS'es I tested the code. Here's a short summary:


Mac OS Summary:

Java Version: 1.6.0_24
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.1-b02-334
VM Name: Java HotSpot(TM) 64-Bit Server VM
OS Name: Mac OS X
OS Arch: x86_64
OS Version: 10.6.5
Processors/Cores: 8

With volatile keyword:
Final results:
31979 ms. when prepare() was called after instantiation.
96482 ms. when prepare() was called before execution.

Without volatile keyword:
Final results:
26009 ms. when prepare() was called after instantiation.
35196 ms. when prepare() was called before execution.


Windows Summary:

Java Version: 1.6.0_24
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.1-b02
VM Name: Java HotSpot(TM) 64-Bit Server VM
OS Name: Windows 7
OS Arch: amd64
OS Version: 6.1
Processors/Cores: 4

With volatile keyword:
Final results:
18120 ms. when prepare() was called after instantiation.
36089 ms. when prepare() was called before execution.

Without volatile keyword:
Final results:
10115 ms. when prepare() was called after instantiation.
10039 ms. when prepare() was called before execution.


Linux Summary:

Java Version: 1.6.0_20
Java Class Version: 50.0
VM Vendor: Sun Microsystems Inc.
VM Version: 19.0-b09
VM Name: OpenJDK 64-Bit Server VM
OS Name: Linux
OS Arch: amd64
OS Version: 2.6.32-28-generic
Processors/Cores: 4

With volatile keyword:
Final results:
45848 ms. when prepare() was called after instantiation.
110754 ms. when prepare() was called before execution.

Without volatile keyword:
Final results:
37862 ms. when prepare() was called after instantiation.
39357 ms. when prepare() was called before execution.


Mac OS Details (volatile):

Test 1, 4 threads, setting variable in creation loop
Thread-2 completed after 653 ms.
Thread-3 completed after 653 ms.
Thread-4 completed after 653 ms.
Thread-5 completed after 653 ms.
Overall time: 654 ms.

Test 2, 4 threads, setting variable in start loop
Thread-7 completed after 1588 ms.
Thread-6 completed after 1589 ms.
Thread-8 completed after 1593 ms.
Thread-9 completed after 1593 ms.
Overall time: 1594 ms.

Test 3, 4 threads, setting variable in creation loop
Thread-10 completed after 648 ms.
Thread-12 completed after 648 ms.
Thread-13 completed after 648 ms.
Thread-11 completed after 648 ms.
Overall time: 648 ms.

Test 4, 4 threads, setting variable in start loop
Thread-17 completed after 1353 ms.
Thread-16 completed after 1957 ms.
Thread-14 completed after 2170 ms.
Thread-15 completed after 2169 ms.
Overall time: 2172 ms.

(and so on, sometimes one or two of the threads in the 'slow' loop finish as expected, but most times they don't).

The given example looks theoretically, as it is of no use, and 'volatile' is not needed here - however, if you'd use a 'java.util.Random'-Instance instead of the 'Container'-Class and call, for instance, nextInt() multiple times, the same effects will occur: The thread will be executed fast if you create the object in the Thread's constructor, but slow if you create it within the run()-method. I believe that the performance issues described in Java Random Slowdowns on Mac OS more than a year ago are related to this effect, but I have no idea why it is as it is - besides that I'm sure that it shouldn't be like that, as it would mean that it's always dangerous to create a new object within the run-method of a thread, unless you know that no volatile variables will get involved within the object graph. Profiling doesn't help, as the problem disappears in this case (same observation as in Java Random Slowdowns on Mac OS cont'd), and it also does not happen on a single-core-PC - so I'd guess that it's kind of a thread synchronization problem... however, the strange thing is that there's actually nothing to synchronize, as all variables are thread-local.

Really looking forward for any hints - and just in case you want to confirm or falsify the problem, see the test case below.

Thanks,

Stephan

public class UnexpectedPerformanceIssue {

private static class Container  {

    // Remove the volatile keyword, and the problem disappears (on windows)
    // or gets smaller (on mac os)
    private volatile long value;

    public long getValue() {
        return value;
    }

    public final void set(long newValue) {
        value = newValue;
    }
}

private static class DemoThread extends Thread {

    private Container variable;

    public void prepare() {
        this.variable = new Container();
    }

    @Override
    public void run() {
        long start = System.nanoTime();
        for(int j = 0; j < 10000000; j++) {
            variable.set(variable.getValue() + System.nanoTime());
        }
        long end = System.nanoTime();
        System.out.println(this.getName() + " completed after "
                +  ((end - start)/1000000) + " ms.");
    }
}

public static void main(String[] args) {
    System.out.println("Java Version: " + System.getProperty("java.version"));
    System.out.println("Java Class Version: " + System.getProperty("java.class.version"));

    System.out.println("VM Vendor: " + System.getProperty("java.vm.specification.vendor"));
    System.out.println("VM Version: " + System.getProperty("java.vm.version"));
    System.out.println("VM Name: " + System.getProperty("java.vm.name"));

    System.out.println("OS Name: " + System.getProperty("os.name"));
    System.out.println("OS Arch: " + System.getProperty("os.arch"));
    System.out.println("OS Version: " + System.getProperty("os.version"));
    System.out.println("Processors/Cores: " + Runtime.getRuntime().availableProcessors());

    System.out.println();
    int numberOfThreads = 4;

    System.out.println("\nReference Test (single thread):");
    DemoThread t = new DemoThread();
    t.prepare();
    t.run();

    DemoThread[] threads = new DemoThread[numberOfThreads];
    long createTime = 0, startTime = 0;
    for(int j = 0; j < 100; j++) {
        boolean prepareAfterConstructor = j % 2 == 0;
        long overallStart = System.nanoTime();
        if(prepareAfterConstructor) {
            System.out.println("\nTest " + (j+1) + ", " + numberOfThreads + " threads, setting variable in creation loop");             
        } else {
            System.out.println("\nTest " + (j+1) + ", " + numberOfThreads + " threads, setting variable in start loop");
        }

        for(int i = 0; i < threads.length; i++) {
            threads[i] = new DemoThread();
            // Either call DemoThread.prepare() here (in odd loops)...
            if(prepareAfterConstructor) threads[i].prepare();
        }

        for(int i = 0; i < threads.length; i++) {
            // or here (in even loops). Should make no difference, but does!
            if(!prepareAfterConstructor) threads[i].prepare();
            threads[i].start();
        }
        joinThreads(threads);
        long overallEnd = System.nanoTime();
        long overallTime = (overallEnd - overallStart);
        if(prepareAfterConstructor) {
            createTime += overallTime;
        } else {
            startTime += overallTime;
        }
        System.out.println("Overall time: " + (overallTime)/1000000 + " ms.");
    }
    System.out.println("Final results:");
    System.out.println(createTime/1000000 + " ms. when prepare() was called after instantiation.");
    System.out.println(startTime/1000000 + " ms. when prepare() was called before execution.");
}

private static void joinThreads(Thread[] threads) {
    for(int i = 0; i < threads.length; i++) {
        try {
            threads[i].join();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

}

It's likely that two volatile variables a and b are too close to each other, they fall in the same cache line; although CPU A only reads/writes variable a, and CPU B only reads/writes variable b, they are still coupled to each other through the same cache line.

In your example, we have two allocation schemes:

new Thread                               new Thread
new Container               vs           new Thread
new Thread                               ....
new Container                            new Container
....                                     new Container

In the first scheme, it's very unlikely that two volatile variables are close to each other. In the 2nd scheme, it's almost certainly the case.

CPU caches don't work with individual words; instead, they deal with cache lines. A cache line is a continuous chunk of memory, say 64 neighboring bytes. Usually this is nice - if a CPU accessed a cell, it's very likely that it will access the neighboring cells too. Except in your example, that assumption is not only invalid, but detrimental.

Suppose a and b fall in the same cache line L. When CPU A updates a, it notifies other CPUs that L is dirty. Since B caches L too, because it's working on b, B must drop its cached L. So next time B needs to read b, it must reload L, which is costly.

If B must access main memory to reload, that is extremely costly, it's usually 100X slower.

Fortunately, A and B can communicate directly about the new values without going through main memory. Nevertheless it takes extra time.

To verify this theory, you can stuff extra 128 bytes in Container, so that two volatile variable of two Container will not fall in the same cache line; then you should observe that the two schemes take about the same time to execute.

Lession learned: usually CPUs assume that adjecent variables are related. If we want independent variables, we better place them far away from each other.

PHP's count(), O(1) or O(n) for arrays?

15 votes

Hi,

do you know if count() in PHP really counts the all elements of a PHP-array, or if this value is cached somewhere and just needs to be retrieved?

The docs don't say much about this and various blog posts that measure the performance of count() don't talk about it either.

(Sorry for the title didn't know how to describe it more precisely.)

Well, we can look at the source:

/ext/standard/array.c

PHP_FUNCTION(count) calls php_count_recursive(), which in turn calls zend_hash_num_elements() for non-recursive array, which is implemented this way:

ZEND_API int zend_hash_num_elements(const HashTable *ht)
{
    IS_CONSISTENT(ht);

    return ht->nNumOfElements;
}

So you can see, it's O(1) for $mode = COUNT_NORMAL.

.NET: Accessing non-public members from a dynamic assembly

14 votes

I'm working on a library that allows users to input arbitrary expressions. My library then compiles those expressions as part of a larger expression into a delegate. Now, for still unknown reasons compiling the expression with Compile sometimes/often results in code that is far slower than it would be if it weren't a compiled expression. I asked a question about this before and one workaround was to not use Compile, but CompileToMethod and create a static method on a new type in a new dynamic assembly. That works and the code is fast.

But users can input arbitrary expressions and it turns out that if the user calls a non-public function or accesses a non-public field in the expression, it throws a System.MethodAccessException (in the case of a non-public method) when the delegate is invoked.

What I could probably do here is create a new ExpressionVisitor that checks if the expression accesses anything non-public and use the slower Compile in those cases, but I'd rather have that the dynamic assembly somehow gets the rights to access the non-public members. Or find out if there's anything I can do about Compile being slower (sometimes).

The full code to reproduce this problem:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Linq.Expressions;
using System.Reflection;
using System.Reflection.Emit;

namespace DynamicAssembly
{
  public class Program
  {
    private static int GetValue()
    {
      return 1;
    }

    public static int GetValuePublic()
    {
      return 1;
    }

    public static int Foo;

    static void Main(string[] args)
    {
      Expression<Func<int>> expression = () => 10 + GetValue();

      Foo = expression.Compile()();

      Console.WriteLine("This works, value: " + Foo);

      Expression<Func<int>> expressionPublic = () => 10 + GetValuePublic();

      var compiledDynamicAssemblyPublic = (Func<int>)CompileExpression(expressionPublic);

      Foo = compiledDynamicAssemblyPublic();

      Console.WriteLine("This works too, value: " + Foo);

      var compiledDynamicAssemblyNonPublic = (Func<int>)CompileExpression(expression);

      Console.WriteLine("This crashes");

      Foo = compiledDynamicAssemblyNonPublic();
    }

    static Delegate CompileExpression(LambdaExpression expression)
    {
      var assemblyBuilder = AppDomain.CurrentDomain.DefineDynamicAssembly(
        new AssemblyName("MyAssembly"+ Guid.NewGuid().ToString("N")), 
        AssemblyBuilderAccess.Run);

      var moduleBuilder = assemblyBuilder.DefineDynamicModule("Module");

      var typeBuilder = moduleBuilder.DefineType("MyType", TypeAttributes.Public);

      var methodBuilder = typeBuilder.DefineMethod("MyMethod", 
        MethodAttributes.Public | MethodAttributes.Static);

      expression.CompileToMethod(methodBuilder);

      var resultingType = typeBuilder.CreateType();

      var function = Delegate.CreateDelegate(expression.Type, 
        resultingType.GetMethod("MyMethod"));

      return function;
    }
  }
}

The problem is not permissions because there is no permission that can allow you to access a non-public field or member of another class without reflection. This is analogous to the situation where you compiled two non-dynamic assemblies and one assembly calls a public method in the second assembly. Then if you change the method to private without recompiling the first assembly, the first assemblies call will now fail at runtime. In other words the expression in your dynamic assembly is being compiled into an ordinary method call which it doesn't have permission to call anymore than you do from another class even in the same assembly.

Since no permission can solve your problem, you might be able to transform non-public field and method references into subexpressions that use reflection.

Here is an example taken from your test case. This fails:

Expression<Func<int>> expression = () => 10 + GetValue();

but this will succeed:

Expression<Func<int>> expression = () => 10 + (int)typeof(Program).GetMethod("GetValue", BindingFlags.Static | BindingFlags.NonPublic).Invoke(null, null);

Since this does not crash with an exception, you can see that your dynamic assembly does have reflection permission and it can access the private method, it just can't do it using an ordinary method call that CompileToMethod results in.

Discontinuous slice in python list

13 votes

Hi folks

I'm looking for an efficient way of achieving this, which I think is a slicing-like operation:

>>> mylist = range(100)
>>>magicslicer(mylist, 10, 20)
[0,1,2,3,4,5,6,7,8,9,30,31,32,33,34,35,36,37,38,39,60,61,62,63......,97,98,99]

the idea is: the slicing gets 10 elements, then skips 20 elements, then gets next 10, then skips next 20, and so on.

I think I should not use loops if possible, for the very reason to use slice is (I guess) to do the "extraction" efficiently in a single operation.

Thanks for reading.

itertools.compress (new in 2.7/3.1) nicely supports use cases like this one, especially when combined with itertools.cycle:

from itertools import cycle, compress
seq = range(100)
criteria = cycle([True]*10 + [False]*20) # Use whatever pattern you like
>>> list(compress(seq, criteria))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

Python 2.7 timing (relative to Sven's explicit list comprehension):

$ ./python -m timeit -s "a = range(100)" "[x for start in range(0, len(a), 30) for x in a[start:start+10]]"
100000 loops, best of 3: 4.96 usec per loop

$ ./python -m timeit -s "from itertools import cycle, compress" -s "a = range(100)" -s "criteria = cycle([True]*10 + [False]*20)" "list(compress(a, criteria))"
100000 loops, best of 3: 4.76 usec per loop

Python 3.2 timing (also relative to Sven's explicit list comprehension):

$ ./python -m timeit -s "a = range(100)" "[x for start in range(0, len(a), 30) for x in a[start:start+10]]"
100000 loops, best of 3: 7.41 usec per loop

$ ./python -m timeit -s "from itertools import cycle, compress" -s "a = range(100)" -s "criteria = cycle([True]*10 + [False]*20)" "list(compress(a, criteria))"
100000 loops, best of 3: 4.78 usec per loop

As can be seen, it doesn't make a great deal of difference relative to the in-line list comprehension in 2.7, but helps significantly in 3.2 by avoiding the overhead of the implicit nested scope.

A similar difference can also be seen in 2.7 if the aim is to iterate over the resulting sequence rather than turn it into a fully realised list:

$ ./python -m timeit -s "a = range(100)" "for x in (x for start in range(0, len(a), 30) for x in a[start:start+10]): pass"
100000 loops, best of 3: 6.82 usec per loop
$ ./python -m timeit -s "from itertools import cycle, compress" -s "a = range(100)" -s "criteria = cycle([True]*10 + [False]*20)" "for x in compress(a, criteria): pass"
100000 loops, best of 3: 3.61 usec per loop

For especially long patterns, it is possible to replace the list in the pattern expression with an expression like chain(repeat(True, 10), repeat(False, 20)) so that it never has to be fully created in memory.

Performance problem with Euler problem and recursion on Int64 types

11 votes

I'm currently learning Haskell using the project Euler problems as my playground. I was astound by how slow my Haskell programs turned out to be compared to similar programs written in other languages. I'm wondering if I've forseen something, or if this is the kind of performance penalties one has to expect when using Haskell.

The following program in inspired by Problem 331, but I've changed it before posting so I don't spoil anything for other people. It computes the arc length of a discrete circle drawn on a 2^30 x 2^30 grid. It is a simple tail recursive implementation and I make sure that the updates of the accumulation variable keeping track of the arc length is strict. Yet it takes almost one and a half minute to complete (compiled with the -O flag with ghc).

import Data.Int

arcLength :: Int64->Int64
arcLength n = arcLength' 0 (n-1) 0 0 where
    arcLength' x y norm2 acc
        | x > y = acc
        | norm2 < 0 = arcLength' (x + 1) y (norm2 + 2*x +1) acc
        | norm2 > 2*(n-1) = arcLength' (x - 1) (y-1) (norm2 - 2*(x + y) + 2) acc
        | otherwise = arcLength' (x + 1) y (norm2 + 2*x + 1) $! (acc + 1)

main = print $ arcLength (2^30)

Here is a corresponding implementation in Java. It takes about 4.5 seconds to complete.

public class ArcLength {
public static void main(String args[]) {
    long n = 1 << 30;
    long x = 0;
    long y = n-1;
    long acc = 0;
    long norm2 = 0;
    long time = System.currentTimeMillis();

    while(x <= y) {
        if (norm2 < 0) {
            norm2 += 2*x + 1;
            x++;
        } else if (norm2 > 2*(n-1)) {
            norm2 += 2 - 2*(x+y);
            x--;
            y--;
        } else {
            norm2 += 2*x + 1;
            x++;
            acc++;
        }
    }

    time = System.currentTimeMillis() - time;
    System.err.println(acc);
    System.err.println(time);
}

}

EDIT: After the discussions in the comments I made som modifications in the Haskell code and did some performance tests. First I changed n to 2^29 to avoid overflows. Then I tried 6 different version: With Int64 or Int and with bangs before either norm2 or both and norm2 and acc in the declaration arcLength' x y !norm2 !acc. All are compiled with

ghc -O3 -prof -rtsopts -fforce-recomp -XBangPatterns arctest.hs

Here are the results:

(Int !norm2 !acc)
total time  =        3.00 secs   (150 ticks @ 20 ms)
total alloc =       2,892 bytes  (excludes profiling overheads)

(Int norm2 !acc)
total time  =        3.56 secs   (178 ticks @ 20 ms)
total alloc =       2,892 bytes  (excludes profiling overheads)

(Int norm2 acc)
total time  =        3.56 secs   (178 ticks @ 20 ms)
total alloc =       2,892 bytes  (excludes profiling overheads)

(Int64 norm2 acc)
arctest.exe: out of memory

(Int64 norm2 !acc)
total time  =       48.46 secs   (2423 ticks @ 20 ms)
total alloc = 26,246,173,228 bytes  (excludes profiling overheads)

(Int64 !norm2 !acc)
total time  =       31.46 secs   (1573 ticks @ 20 ms)
total alloc =       3,032 bytes  (excludes profiling overheads)

I'm using GHC 7.0.2 under a 64-bit Windows 7 (The Haskell platform binary distribution). According to the comments, the problem does not occur when compiling under other configurations. This makes me think that the Int64 type is broken in the Windows release.

Hm, I installed a fresh Haskell platform with 7.0.3, and get roughly the following core for your program (-ddump-simpl):

Main.$warcLength' =
  \ (ww_s1my :: GHC.Prim.Int64#) (ww1_s1mC :: GHC.Prim.Int64#)
    (ww2_s1mG :: GHC.Prim.Int64#) (ww3_s1mK :: GHC.Prim.Int64#) ->
    case {__pkg_ccall ghc-prim hs_gtInt64 [...]
           ww_s1my ww1_s1mC GHC.Prim.realWorld#
[...]

So GHC has realized that it can unpack your integers, which is good. But this hs_getInt64 call looks suspiciously like a C call. Looking at the assembler output (-ddump-asm), we see stuff like:

pushl %eax
movl 76(%esp),%eax
pushl %eax
call _hs_gtInt64
addl $16,%esp

So this looks very much like every operation on the Int64 get turned into a full-blown C call in the backend. Which is slow, obviously.

The source code of GHC.IntWord64 seems to verify that: In a 32-bit build (like the one currently shipped with the platform), you will have only emulation via the FFI interface.

How to find performance hot spots in .Net application?

6 votes

I have an existing ASP.NET 2.0 web service serves several WinForms clients. In our application, We belive we have performance problem in several levels.

  1. Sending toomuch data in syncrounous request
  2. Lazy loading, Too many round trip between web service and database
  3. POCO <-> Sql object mapping using untyped datasets and reflection[no caching]

This is an existing application with large code base, I would like to instrument this app to find out hotspots.

  1. How can I instrument remote apps like winforms client deployed in remote places?
  2. How can I instrument Web Service?
  3. Edit** Are there any profilers better than VS profiler?
  4. Can I trust the profilers to tell me the hot spots, so I don't pollute my code with instrumentation? Or Do I have to take the middle road, profiler + instrumentation?

This may help:
ASP.NET and WorkerThreads – Interesting performance tweaks
20 Tips to Improve ASP.net Application Performance
Use Custom Http Handlers To Improve Performance in ASP.NET
Best practices for ASP.Net applications
Scaling Strategies for ASP.NET Applications

Optimising a Haskell XML parser

5 votes

I'm experimenting with Haskell at the moment and am very much enjoying the experience, but am evaluating it for a real project with some fairly stringent performance requirements. The first pass of my task is to process a complete (no-history) dump of wikipedia (bzipped) - totalling about 6Gb compressed. In python a script to do a full extract of each raw page (about 10 million in total) takes about 30 minutes on my box (and for reference a scala implementation using the pull parser takes about 40 mins). I've been attempting to replicate this performance using Haskell and ghc and have been struggling to match this.

I've been using Codec.Compression.BZip for decompression and hexpat for parsing. I'm using lazy bytestrings as the input to hexpat and strict bytestrings for the element text type. And to extract the text for each page I'm building up a Dlist of pointers to text elements and then iterating over this to dump it out to stdout. The code I've just described has already been through a number of profiling/refactor iterations (I quickly moved from strings to bytestrings, then from string concatenation to lists of pointers to text - then to dlists of pointers to text). I think I've got about 2 orders of magnitude speedup from the original code, but it still takes over an hour and a half to parse (although it has a lovely small memory footprint). So I'm looking for a bit of inspiration from the community to get me the extra mile. The code is below (and I've broken it up into a number of subfunctions in order to get more detail from the profiler). Please excuse my Haskell - I've only been coding for a couple of days (having spent a week with Real World Haskell). And thanks in advance!

import System.Exit
import Data.Maybe
import Data.List
import Data.DList (DList)
import qualified Data.DList as DList

import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString.Char8 as BS
import qualified Data.ByteString.Lazy as LazyByteString
import qualified Codec.Compression.BZip as BZip

import Text.XML.Expat.Proc
import Text.XML.Expat.Tree
import Text.XML.Expat.Format

testFile = "../data/enwiki-latest-pages-articles.xml.bz2"

validPage pageData = case pageData of
    (Just _, Just _) -> True
    (_, _) -> False

scanChildren :: [UNode ByteString] -> DList ByteString
scanChildren c = case c of
    h:t -> DList.append (getContent h) (scanChildren t)
    []  -> DList.fromList []

getContent :: UNode ByteString -> DList ByteString
getContent treeElement =
    case treeElement of
        (Element name attributes children)  -> scanChildren children
        (Text text)                         -> DList.fromList [text]

rawData t = ((getContent.fromJust.fst) t, (getContent.fromJust.snd) t)

extractText page = do
    revision <- findChild (BS.pack "revision") page
    text <- findChild (BS.pack "text") revision
    return text

pageDetails tree =
    let pageNodes = filterChildren relevantChildren tree in
    let getPageData page = (findChild (BS.pack "title") page, extractText page) in
    map rawData $ filter validPage $ map getPageData pageNodes
    where
        relevantChildren node = case node of
            (Element name attributes children) -> name == (BS.pack "page")
            (Text _) -> False

outputPages pagesText = do
    let flattenedPages = map DList.toList pagesText
    mapM_ (mapM_ BS.putStr) flattenedPages

readCompressed fileName = fmap BZip.decompress (LazyByteString.readFile fileName)
parseXml byteStream = parse defaultParseOptions byteStream :: (UNode ByteString, Maybe XMLParseError)

main = do
    rawContent <- readCompressed testFile
    let (tree, mErr) = parseXml rawContent
    let pages = pageDetails tree
    let pagesText = map snd pages
    outputPages pagesText
    putStrLn "Complete!"
    exitWith ExitSuccess

After running your program I get somewhat weird results:

./wikiparse  +RTS -s -A5m -H5m | tail
./wikiparse +RTS -s -A5m -H5m
3,604,204,828,592 bytes allocated in the heap
  70,746,561,168 bytes copied during GC
      39,505,112 bytes maximum residency (37822 sample(s))
       2,564,716 bytes maximum slop
              83 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 620343 collections,     0 parallel, 15.84s, 368.69s elapsed
  Generation 1: 37822 collections,     0 parallel,  1.08s, 33.08s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time  243.85s  (4003.81s elapsed)
  GC    time   16.92s  (401.77s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time  260.77s  (4405.58s elapsed)

  %GC time       6.5%  (9.1% elapsed)

  Alloc rate    14,780,341,336 bytes per MUT second

  Productivity  93.5% of total user, 5.5% of total elapsed

Total time is more than OK I think: 260s is way faster than 30m for Python. I have no idea though why the overall time is so big here. I really don't think that reading 6Gb file would take more than an hour to complete.

I'm running your program again to check if the results are consistent.

If the result of those 4'20'' is right, then I believe something is wrong with the machine... or there is some other strange effect here.

The code was compiled on GHC 7.0.2.


Edit: I tried various versions of the program above. The most important optimization seems to be {-# INLINE #-} pragma and specialization of functions. Some have pretty generic types, which is known to be bad for performance. OTOH I believe inlining should be enough to trigger the specialization, so you should try to experiment further with this.

I didn't see any significant difference across the versions of GHC I tried (6.12 .. HEAD).

Haskell bindings to bzlib seems to have optimal performance. The following program, which is near-complete reimplementation of standard bzcat program, is as fast or even faster than the original.

module Main where

import qualified Data.ByteString.Lazy as BSL
import qualified Codec.Compression.BZip as BZip
import System.Environment (getArgs)

readCompressed fileName = fmap (BZip.decompress) (BSL.readFile fileName)

main :: IO ()
main = do
    files <- getArgs
    mapM_ (\f -> readCompressed f >>= BSL.putStr) files                 

On my machine it takes ~1100s to decompress the test file to /dev/null. The fastest version I was able to get was based on SAX style parser. I'm not sure though if the output matches that of the original. On small outputs the result is the same, and so is the performance. On the original file the SAX version is somewhat faster and completes in ~2400s. You can find it below.

{-# LANGUAGE OverloadedStrings #-}

import System.Exit
import Data.Maybe

import Data.ByteString.Char8 (ByteString)
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as BSL
import qualified Codec.Compression.BZip as BZip

import System.IO

import Text.XML.Expat.SAX as SAX

type ByteStringL = BSL.ByteString
type Token = ByteString
type TokenParser = [SAXEvent Token Token] -> [[Token]]

testFile = "/tmp/enwiki-latest-pages-articles.xml.bz2"


readCompressed :: FilePath -> IO ByteStringL
readCompressed fileName = fmap (BZip.decompress) (BSL.readFile fileName)

{-# INLINE pageStart #-}
pageStart :: TokenParser
pageStart ((StartElement "page" _):xs) = titleStart xs
pageStart (_:xs) = pageStart xs
pageStart [] = []

{-# INLINE titleStart #-}
titleStart :: TokenParser
titleStart ((StartElement "title" _):xs) = finish "title" revisionStart xs
titleStart ((EndElement "page"):xs) = pageStart xs
titleStart (_:xs) = titleStart xs
titleStart [] = error "could not find <title>"


{-# INLINE revisionStart #-}
revisionStart :: TokenParser
revisionStart ((StartElement "revision" _):xs) = textStart xs
revisionStart ((EndElement "page"):xs) = pageStart xs
revisionStart (_:xs) = revisionStart xs
revisionStart [] = error "could not find <revision>"

{-# INLINE textStart #-}
textStart :: TokenParser
textStart ((StartElement "text" _):xs) = textNode [] xs
textStart ((EndElement "page"):xs) = pageStart xs
textStart (_:xs) = textStart xs
textStart [] = error "could not find <text>"

{-# INLINE textNode #-}
textNode :: [Token] -> TokenParser
textNode acc ((CharacterData txt):xs) = textNode (txt:acc) xs
textNode acc xs = (reverse acc) : textEnd xs

{-# INLINE textEnd #-}
textEnd {- , revisionEnd, pageEnd -} :: TokenParser
textEnd = finish "text" . finish "revision" . finish "page" $ pageStart
--revisionEnd = finish "revision" pageEnd
--pageEnd = finish "page" pageStart

{-# INLINE finish #-}
finish :: Token -> TokenParser -> TokenParser
finish tag cont ((EndElement el):xs) | el == tag = cont xs
finish tag cont (_:xs) = finish tag cont xs
finish tag _ [] = error (show (tag,("finish []" :: String)))

main :: IO ()
main = do
  rawContent <- readCompressed testFile
  let parsed = (pageStart (SAX.parse defaultParseOptions rawContent))
  mapM_ (mapM_ BS.putStr) ({- take 5000 -} parsed) -- remove comment to finish early
  putStrLn "Complete!"

Generally I'm suspicious that Python's and Scala's versions are finishing early. I couldn't verify that claim though without the source code.

To sum up: inlining and specialization should give reasonable, about two-fold increase in performance.