Best questions in May 2011

Cycles in family tree software

452 votes

I am the developer of some family tree software (written in C++ and Qt). I had no problems until one of my customers mailed me a bug report. The problem is that he has two children with his own daughter, and he can't use my software because of errors.

Those errors are the result of my various assertions and invariants about the family graph being processed (for example, after walking a cycle, the program states that X can't be both father and grandfather of Y).

How can I resolve those errors without removing all data assertions?

I guess that your have some value that uniquly identifies a Person on which you base your checks.

Tricky one this. Assuming you want to keep the structure a tree.

I suggest this:

Assume this A has kids with his own daughter.

A adds himself to the programm as A and as B. Once in the role of Father once in the role of lets call it boyfriend.

Add a is_same_for_out() function which tells the output generating part of your programm that all links going to B internally should be going to A on presentation of data.

This will make some extra work for the user but would I guess be relatively easy to implement and maintain.

Building from that you could work on code syncing A and B to avoid inconsistencies.

This solution sure is not perfect but a first approach.

What does && mean in void *p = &&abc;

71 votes

I came across a piece of code void *p = &&abc;. What is the significance of && here? I know about rvalue references but I think && used in this context is different. What does && indicate in void *p = &&abc; ?

&& is gcc's extension to get the address of the label defined in the current function.

void *p = &&abc is illegal in standard C99 and C++.

This compiles with g++

Why in python 0 < 0 == 0 is False?

61 votes

Looking into Queue.py in python 2.6 I found this construct that I found a bit strange:

def full(self):
    """Return True if the queue is full, False otherwise
    (not reliable!)."""
    self.mutex.acquire()
    n = 0 < self.maxsize == self._qsize()
    self.mutex.release()
    return n

If maxsize is 0 the queue is never full.

My question is how does it work for this case? How 0 < 0 == 0 is considered False?

>>> 0 < 0 == 0
False
>>> (0) < (0 == 0)
True
>>> (0 < 0) == 0
True
>>> 0 < (0 == 0)
True

I believe Python has special case handling for sequences of relational operators to make range comparisons easy to express. It's much nicer to be able to say 0 < x <= 5 than to say (0 < x) and (x <= 5).

These are called chained comparisons. And that's a link to the documentation for them.

With the other cases you talk about, the parenthesis force one relational operator to be applied before the other, and so they are no longer chained comparisons. And since True and False have values as integers you get the answers you do out of the parenthesized versions.

.prop() vs .attr()

53 votes

Ok.

So jQuery 1.6 has the new function prop().

$(selector).click(function(){
    //instead of:
    this.getAttribute('style');
    //do i use:
    $(this).prop('style');
    //or:
    $(this).attr('style');
})

or in this case do they do the same thing?

And if I do have to switch to using prop(), all the old attr() calls will break if i switch to 1.6?

UPDATE

See this fiddle: http://jsfiddle.net/maniator/JpUF2/

The console logs the getAttribute as a string, and the attr as a string, but the prop as a CSSStyleDeclaration, Why? And how does that affect my coding in the future?

If you've only ever used jQuery and not the DOM directly, this could be a confusing change, although it is definitely an improvement conceptually. Not so good for the bazillions of sites using jQuery that will break as a result of this change though.

I'll summarize the main issues:

  • You usually want prop() rather than attr().
  • In the majority of cases, prop() does what attr() used to do. Replacing calls to attr() with prop() in your code will generally work.
  • Forget about attributes. You nearly always want a property, not an attribute. Where both a property and an attribute with the same name exists, the property always represents the current state while the attribute (in most browsers) represents the initial state. This affects properties such as the value property of <input> elements.
  • This change removes some of the layer of misguided magic jQuery stuck in front of attributes and properties, meaning jQuery developers will have to learn a bit about the difference between properties and attributes. This is a good thing.
  • Properties are generally simpler to deal with than attributes.

If you're a jQuery developer and are confused by this whole business about properties and attributes, you need to take a step back and learn a little about it, since jQuery is no longer trying to shield you from this stuff. For the authoritative but somewhat dry word on the subject, there's the specs: HTML DOM spec, DOM Level 2 spec, DOM Level 3 spec. Mozilla's DOM documentation is valid for most modern browsers and is easier to read than the specs, so you may find their DOM reference helpful. There's a section on element properties.

As an example of how properties are simpler to deal with than attributes, consider a checkbox that is initially checked. Here are two possible pieces of valid HTML to do this:

<input id="foo" type="checkbox" checked>
<input id="foo" type="checkbox" checked="checked">

So, how do you find out if the checkbox is checked with jQuery? Look on Stack Overflow and you'll commonly find the following suggestions:

  • if ( $("#cb").attr("checked") === true ) {...}
  • if ( $("#cb").attr("checked") == "checked" ) {...}
  • if ( $("#cb").is(":checked") ) {...}

This is actually the simplest thing in the world to do with the checked Boolean property, which has existed and worked flawlessly in every major scriptable browser since 1995:

if (document.getElementById("cb").checked) {...}

The property also makes checking or unchecking the checkbox trivial:

document.getElementById("cb").checked = false

In jQuery 1.6, this unambiguously becomes

$("#cb").prop("checked", false)

The idea of using the checked attribute for scripting a checkbox is unhelpful and unnecessary. The property is what you need.

  • It's not obvious what the correct way to check or uncheck the checkbox is using the checked attribute
  • The attribute value never changes to reflect the current state (except in some older versions of IE, thus making things still harder). Once the checkbox has been checked or unchecked (either by the user or by script), the attribute tells you nothing about the current state and is entirely useless. See http://jsfiddle.net/VktA6/.

C++: is return value a L-value?

48 votes

Consider this code:

struct foo
{
  int a;
};

foo q() { foo f; f.a =4; return f;}

int main()
{
  foo i;
  i.a = 5;
  q() = i;
}

No compiler complains about it, even Clang. Why q() = ... line is correct?

No, the return value of a function is an l-value if and only if it is a reference (C++03). (5.2.2 [expr.call] / 10)

If the type returned were a basic type then this would be a compile error. (5.17 [expr.ass] / 1)

The reason that this works is that you are allowed to call member functions (even non-const member functions) on r-values of class type and the assignment of foo is an implementation defined member function: foo& foo::operator=(const foo&). The restrictions for operators in clause 5 only apply to built-in operators, (5 [expr] / 3), if overload resolution selects an overloaded function call for an operator then the restrictions for that function call apply instead.

This is why it is sometimes recommended to return objects of class type as const objects (e.g. const foo q();), however this can have a negative impact in C++0x where it can inhibit move semantics from working as they should.

VB.NET vs C# integer division

43 votes

Anyone care to explain why these two pieces of code exhibit different results?

VB.NET v4.0

Dim p As Integer = 16
Dim i As Integer = 10
Dim y As Integer = p / i
//Result: 2

C# v4.0

int p = 16;
int i = 10;
int y = p / i;
//Result: 1

When you look at the IL-code that those two snippets produce, you'll realize that VB.NET first converts the integer values to doubles, applies the division and then rounds the result before it's converted back to int32 and stored in y.

C# does none of that.

VB.NET IL Code:

IL_0000:  ldc.i4.s    10 
IL_0002:  stloc.1     
IL_0003:  ldc.i4.s    0A 
IL_0005:  stloc.0     
IL_0006:  ldloc.1     
IL_0007:  conv.r8     
IL_0008:  ldloc.0     
IL_0009:  conv.r8     
IL_000A:  div         
IL_000B:  call        System.Math.Round
IL_0010:  conv.ovf.i4 
IL_0011:  stloc.2     
IL_0012:  ldloc.2     
IL_0013:  call        System.Console.WriteLine

C# IL Code:

IL_0000:  ldc.i4.s    10 
IL_0002:  stloc.0     
IL_0003:  ldc.i4.s    0A 
IL_0005:  stloc.1     
IL_0006:  ldloc.0     
IL_0007:  ldloc.1     
IL_0008:  div         
IL_0009:  stloc.2     
IL_000A:  ldloc.2     
IL_000B:  call        System.Console.WriteLine

The "proper" integer division needs a backwards slash: p \ i

Why does C# allow {} code blocks without a preceding statement?

43 votes

Why does C# allow code blocks without a preceding statement (e.g. if, else, for, while)?

void Main()
{
    {   // any sense in this?
        Console.Write("foo");
    }
}

In the context you give, there is no significance. Writing a constant string to the console is going to work the same way anywhere in program flow.

Instead, you typically use them to restrict the scope of some local variables. This is further elaborated here and here. Look at João Angelo’s answer and Chris Wallis’s answer for brief examples. I believe the same applies to some other languages with C-style syntax as well, not that they’d be relevant to this question though.

How does _gaq.push(['_trackPageLoadTime']) work?

43 votes

How does the Google Analytics Site Speed feature, _gaq.push(['_trackPageLoadTime']), work? Is there any documentation about how it works?

From the Google Analytics Help Center:

This report currently supports the following browsers: Chrome, Internet Explorer 9 and previous versions of Internet Explorer with the Google Toolbar installed. More specifically, the Site Speed reports require browsers that support the HTML5 NavigationTiming interface or have the Google Internet Explorer toolbar installed

So, it doesn't implement its own timer, like many prior homeback solutions had, to figure out how long it takes a page to load. Instead, it uses a new HTML5 feature, currently only supported in the above listed cases, called NavigationTiming.

(Important to note that it doesn't run on every load; instead, it currently samples around 2% of pageviews, though it is configured to try to track 10% page loads on 10% of visits; as more browsers support the NavigationTiming API, you can expect this number to begin to get closer to 10%.)

This interface is accessed under the DOM object window.performance (or, in earlier versions of Chrome, window.webkitPerformance), using the timing attribute (so, window.performance.timing). The object stores measured values of all of the key page load event times, and Google Analytics subtracts 2 of the more important outer values to judge page load speed.

For a load of Mashable.com without cache, here's an example of what it measures (in Chrome 11):

timing = {
connectEnd: 1306677079337,
connectStart: 1306677079337,
domComplete: 1306677083482,
domContentLoadedEventEnd: 1306677081765,
domContentLoadedEventStart: 1306677081576,
domInteractive: 1306677081576,
domLoading: 1306677079478,
domainLookupEnd: 1306677079337,
domainLookupStart: 1306677079337,
fetchStart: 1306677079337,
loadEventEnd: 1306677083483,
loadEventStart: 1306677083482,
navigationStart: 1306677079337,
redirectEnd: 0,
redirectStart: 0,
requestStart: 1306677079394,
responseEnd: 1306677079669,
responseStart: 1306677079476,
secureConnectionStart: 0,
unloadEventEnd: 0,
unloadEventStart: 0
}

Those numbers are epoch milliseconds, or milliseconds since January 1, 1970. I have not seen any documentation as to which values they subtract to generate their values, but from a cursory inspection of the ga.js, it looks like it is loadEventStart-fetchStart:

h&&h[c]!=k&&h.isValidLoadTime?b=h[c]:e&&e[a]&&(b=e[a].loadEventStart-e[a].fetchStart);

For the above sample, that means it would record 4.14 seconds in the _trackPageLoadTime call.

From the W3C Navigation Timing spec:

fetchStart attribute

If the new resource is to be fetched using HTTP GET or equivalent, fetchStart must return the time immediately before the user agent starts checking any relevant application caches. Otherwise, it must return the time when the user agent starts fetching the resource.

loadEventStart attribute

This attribute must return the time immediately before the load event of the the current document is fired. It must return zero when the load event is not fired yet.

For curious parties, the ordering appears to be as follows:

connectStart, connectEnd, domainLookupStart, domainLookupEnd, fetchStart, navigationStart, requestStart, responseStart, domLoading, responseEnd, domContentLoadedEventStart, domInteractive, domContentLoadedEventEnd, domComplete, loadEventStart, loadEventEnd

For the 0 values listed:

unloadEventStart and unloadEventStart show the times for the previous page load's unloading (but only if that page has the same origin as the current one.)

redirectEnd and redirectStart measure the latency added if there was an HTTP redirect in the page load chain.

secureConnectionStart appears to be a (non-standard?) measurement for measuring the SSL connection time.

"cannot implement interface member" error when interface and concrete are in different projects

41 votes

This compiles:

public interface IMyInterface
{
    event Action<dynamic> OnSomeEvent;
}

class MyInterface : IMyInterface
{
    public event Action<dynamic> OnSomeEvent;
}

But when i separate the interface and the implementation to different projects, i get:

Accessor 'TestProject2.MyInterface.OnSomeEvent.remove' cannot implement interface member 'InterfaceNamespace.IMyInterface.remove_OnSomeEvent(System.Action)' for type 'TestProject2.MyInterface'. Use an explicit interface implementation.

This occurs only with a dynamic parameter...

Good catch. This looks like it's possibly a bug in the C# compiler - I'll ping Eric Lippert to see what he thinks. (dynamic can be a bit tricksy; there may well be a perfectly good but non-obvious reason for this error.)

EDIT: The code below appears not to work after all. I could have sworn I had it working this morning... I'm very confused as to what's going on. As per Simon's comments, the code fails with a message saying it's not supported by the language.

Note that if you do use explicit interface implementation, it appears to compile just fine:

// Doesn't actually compile - see edit above
class MyInterface : IMyInterface
{
    private Action<dynamic> foo;

    event Action<dynamic> IMyInterface.OnSomeEvent
    {
        // TODO (potentially): thread safety
        add { foo += value; }
        remove { foo -= value; }
    }
}

EDIT: The rest of this answer still stands...

Note that you can't specify a field-like event as an explicitly-implemented event, i.e. this doesn't work:

event Action<dynamic> IMyInterface.OnSomeEvent;

It gives the following error message:

Test.cs(15,39): error CS0071: An explicit interface implementation of an event must use event accessor syntax

And if you just try to change to the event accessor syntax, you get the same error as the original code.

Note that changing the event to a property works fine with an auto-implemented property implementation.

LogLog algorithm for counting of large cardinalities

37 votes

Where can I find a valid implementation of LogLog algorithm? Have tried to implement it by myself but my draft implementation yields strange results.

Here it is:

function LogLog(max_error, max_count)
{
    function log2(x)
    {
         return Math.log(x) / Math.LN2;
    }

    var m = 1.30 / max_error;
    var k = Math.ceil(log2(m * m));
    m = Math.pow(2, k);

    var k_comp = 32 - k;

    var l = log2(log2(max_count / m));
    if (isNaN(l)) l = 1; else l = Math.ceil(l);
    var l_mask = ((1 << l) - 1) >>> 0;

    var M = [];
    for (var i = 0; i < m; ++i) M[i] = 0;

    function count(hash)
    {
          if (hash !== undefined)
          {
                var j = hash >>> k_comp;

                var rank = 0;
                for (var i = 0; i < k_comp; ++i)
                {
                     if ((hash >>> i) & 1)
                     {
                          rank = i + 1;
                          break;
                     }
                }

                M[j] = Math.max(M[j], rank & l_mask);
          }
          else
          {
                var c = 0;
                for (var i = 0; i < m; ++i) c += M[i];
                return 0.79402 * m * Math.pow(2, c / m);
          }
    }

    return {count: count};
}

function fnv1a(text)
{
     var hash = 2166136261;
     for (var i = 0; i < text.length; ++i)
     {
          hash ^= text.charCodeAt(i);
          hash += (hash << 1) + (hash << 4) + (hash << 7) +
            (hash << 8) + (hash << 24);
     }
    return hash >>> 0;
}

var words = ['aardvark', 'abyssinian', ... ,'zoology']; // about 2 300 words

var log_log = LogLog(0.01, 100000);
for (var i = 0; i < words.length; ++i) log_log.count(fnv1a(words[i]));
alert(log_log.count());

For unknown reason implementation is very sensitive to max_error parameter, it is the main factor that determines the magnitude of the result. I'm sure, there is some stupid mistake :)

UPDATE: This problem is solved in the newer version of algorithm. I will post its implementation later.

Here it is the updated version of the algorithm based on the newer paper:

var pow_2_32 = 0xFFFFFFFF + 1;

function HyperLogLog(std_error)
{
     function log2(x)
     {
          return Math.log(x) / Math.LN2;
     }

     function rank(hash, max)
     {
          var r = 1;
          while ((hash & 1) == 0 && r <= max) { ++r; hash >>>= 1; }
          return r;
     }

     var m = 1.04 / std_error;
     var k = Math.ceil(log2(m * m)), k_comp = 32 - k;
     m = Math.pow(2, k);

     var alpha_m = m == 16 ? 0.673
          : m == 32 ? 0.697
          : m == 64 ? 0.709
          : 0.7213 / (1 + 1.079 / m);

     var M = []; for (var i = 0; i < m; ++i) M[i] = 0;

     function count(hash)
     {
          if (hash !== undefined)
          {
                var j = hash >>> k_comp;
                M[j] = Math.max(M[j], rank(hash, k_comp));
          }
          else
          {
                var c = 0.0;
                for (var i = 0; i < m; ++i) c += 1 / Math.pow(2, M[i]);
                var E = alpha_m * m * m / c;

                // -- make corrections

                if (E <= 5/2 * m)
                {
                     var V = 0;
                     for (var i = 0; i < m; ++i) if (M[i] == 0) ++V;
                     if (V > 0) E = m * Math.log(m / V);
                }
                else if (E > 1/30 * pow_2_32)
                     E = -pow_2_32 * Math.log(1 - E / pow_2_32);

                // --

                return E;
          }
    }

    return {count: count};
}

function fnv1a(text)
{
     var hash = 2166136261;
     for (var i = 0; i < text.length; ++i)
     {
          hash ^= text.charCodeAt(i);
          hash += (hash << 1) + (hash << 4) + (hash << 7) +
            (hash << 8) + (hash << 24);
     }
     return hash >>> 0;
}

var words = ['aardvark', 'abyssinian', ..., 'zoology']; // 2336 words

var seed = Math.floor(Math.random() * pow_2_32); // make more fun

var log_log = HyperLogLog(0.065);
for (var i = 0; i < words.length; ++i) log_log.count(fnv1a(words[i]) ^ seed);
var count = log_log.count();
alert(count + ', error ' +
    (count - words.length) / (words.length / 100.0) + '%');

Why include when you can require in PHP?

36 votes

What is the point of attempting to include or include_once a library in PHP if you often have to use require or require_once instead (as they're usually critically important)?

The difference between include/include_once and require/require_once is that when a file is not found, the former triggers a warning whereas the latter triggers an error. Opinions may vary but IMO, if a needed file is not present or readable for some reason, this represents a very broken situation and I would rather see an error and halt execution, than have a warning and proceed execution in a clearly broken context. So I believe it to be best practice to use require/require_once exclusively and avoid include/include_once altogether.

java: "final" System.out?

34 votes

System.out is declared as public static final PrintStream out.

But you can call System.setOut() to reassign it.

Huh? How is this possible if it's final?

(same point applies to System.in and System.err)

And more importantly, if you can mutate the public static final fields, what does this mean as far as the guarantees (if any) that final gives you? (I never realized nor expected System.in/out/err behaved as final variables)

JLS 17.5.4 Write Protected Fields:

Normally, final static fields may not be modified. However System.in, System.out, and System.err are final static fields that, for legacy reasons, must be allowed to be changed by the methods System.setIn, System.setOut and System.setErr. We refer to these fields as being write-protected to distinguish them from ordinary final fields.

The compiler needs to treat these fields differently from other final fields. For example, a read of an ordinary final field is "immune" to synchronization: the barrier involved in a lock or volatile read does not have to affect what value is read from a final field. Since the value of write-protected fields may be seen to change, synchronization events should have an effect on them. Therefore, the semantics dictate that these fields be treated as normal fields that cannot be changed by user code, unless that user code is in the System class.

By the way, actually you can mutate final fields via reflection by calling setAccessible(true) on them (or by using Unsafe methods). Such techniques are used during deserialization, by Hibernate and other frameworks, etc, but they have one limitation: code that have seen value of final field before modification is not guaranteed to see the new value after modification. What's special about the fields in question is that they are free of this limitation since they are treated in special way by the compiler.

C# Method Resolution, long vs int

33 votes
class foo
{
  public void bar(int i) { ... };
  public void bar(long i) { ... };
}


foo.bar(10);

I would expect this code to give me some error, or at least an warning, but not so...

What version of bar() is called, and why?

The int version of bar is being called, because 10 is an int literal and the compiler will look for the method which closest matches the input variable(s). To call the long version, you'll need to specify a long literal like so: foo.bar(10L);

Here is a post by Eric Lippert on much more complicated versions of method overloading. I'd try and explain it, but he does a much better job and I ever could: http://blogs.msdn.com/b/ericlippert/archive/2006/04/05/odious-ambiguous-overloads-part-one.aspx

from the C# 4.0 Specification:

Method overloading permits multiple methods in the same class to have the same name as long as they have unique signatures. When compiling an invocation of an overloaded method, the compiler uses overload resolution to determine the specific method to invoke. Overload resolution finds the one method that best matches the arguments or reports an error if no single best match can be found. The following example shows overload resolution in effect. The comment for each invocation in the Main method shows which method is actually invoked.

 class Test {   
      static void F() {
        Console.WriteLine("F()");   
      }     
      static void F(object x) {
        Console.WriteLine("F(object)");     
      }
      static void F(int x) {
        Console.WriteLine("F(int)");    
      }
      static void F(double x) {
        Console.WriteLine("F(double)");     
      }
      static void F<T>(T x) {
        Console.WriteLine("F<T>(T)");   
      }
      static void F(double x, double y) {
        Console.WriteLine("F(double,double)");  
      }     

      static void Main() {
        F();                // Invokes F()
        F(1);           // Invokes F(int)
        F(1.0);         // Invokes F(double)
        F("abc");       // Invokes F(object)
        F((double)1);       // Invokes F(double)
        F((object)1);       // Invokes F(object)
        F<int>(1);      // Invokes F<T>(T)
        F(1, 1);        // Invokes F(double, double)
      } 
}

As shown by the example, a particular method can always be selected by explicitly casting the arguments to the exact parameter types and/or explicitly supplying type arguments.

Best practice to record large amount of hits into MySQL database

32 votes

Well, this is the thing. Let's say that my future PHP CMS need to drive 500k visitors daily and I need to record them all in MySQL database (referrer, ip address, time etc.). This way I need to insert 300-500 rows per minute and update 50 more. The main problem is that script would call database every time I want to insert new row, which is every time someone hits a page.

My question, is there any way to locally cache incoming hits first (and what is the best solution for that apc, csv...?) and periodically send them to database every 10 minutes for example? Is this good solution and what is the best practice for this situation?

500k daily it's just 5-7 queries per second. If each request will be served for 0.2 sec, then you will have almost 0 simultaneous queries, so there is nothing to worry about.
Even if you will have 5 times more users - all should work fine.
You can just use INSERT DELAYED and tune your mysql.
About tuning: http://www.day32.com/MySQL/ - there is very useful script (will change nothing, just show you the tips how to optimize settings).

You can use memcache or APC to write log there first, but with using INSERT DELAYED MySQL will do almost same work, and will do it better :)

Do not use files for this. DB will serve locks much better, than PHP. It's not so trivial to write effective mutexes, so let DB (or memcache, APC) do this work.

Does multithreading emphasize memory fragmentation ?

31 votes

Description

When allocating and deallocating randomly sized memory chunks with 4 or more threads using openmp's parallel for construct, the program seems to start leaking considerable amounts of memory in the second half of the test-program's runtime. Thus it increases its consumed memory from 1050 MB to 1500 MB or more without actually making use of the extra memory.

As valgrind shows no issues, I must assume that what appears to be a memory leak actually is an emphasized effect of memory fragmentation.

Interestingly, the effect does not show yet if 2 threads make 10000 allocations each, but it shows strongly if 4 threads make 5000 allocations each. Also, if the maximum size of allocated chunks is reduced to 256kb (from 1mb), the effect gets weaker.

Can heavy concurrency emphasize fragmentation that much ? Or is this more likely to be a bug in the heap ?

Test Program Description

The demo program is build to obtain a total of 256 MB of randomly sized memory chunks from the heap, doing 5000 allocations. If the memory limit is hit, the chunks allocated first will be deallocated until the memory consumption falls below the limit. Once 5000 allocations where performed, all memory is released and the loop ends. All this work is done for each thread generated by openmp.

This memory allocation scheme allows us to expect a memory consumption of ~260 MB per thread (including some bookkeeping data).

Demo Program

As this is really something you might want to test, you can download the sample program with a simple makefile from dropbox.

When running the program as is, you should have at least 1400 MB of RAM available. Feel free to adjust the constants in the code to suit your needs.

For completeness, the actual code follows:

#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <deque>

#include <omp.h>
#include <math.h>

typedef unsigned long long uint64_t;

void runParallelAllocTest()
{
    // constants
    const int  NUM_ALLOCATIONS = 5000; // alloc's per thread
    const int  NUM_THREADS = 4;       // how many threads?
    const int  NUM_ITERS = NUM_THREADS;// how many overall repetions

    const bool USE_NEW      = true;   // use new or malloc? , seems to make no difference (as it should)
    const bool DEBUG_ALLOCS = false;  // debug output

    // pre store allocation sizes
    const int  NUM_PRE_ALLOCS = 20000;
    const uint64_t MEM_LIMIT = (1024 * 1024) * 256;   // x MB per process
    const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1;

    srand(1);
    std::vector<size_t> allocations;
    allocations.resize(NUM_PRE_ALLOCS);
    for (int i = 0; i < NUM_PRE_ALLOCS; i++) {
        allocations[i] = rand() % MAX_CHUNK_SIZE;   // use up to x MB chunks
    }


    #pragma omp parallel num_threads(NUM_THREADS)
    #pragma omp for
    for (int i = 0; i < NUM_ITERS; ++i) {
        uint64_t long totalAllocBytes = 0;
        uint64_t currAllocBytes = 0;

        std::deque< std::pair<char*, uint64_t> > pointers;
        const int myId = omp_get_thread_num();

        for (int j = 0; j < NUM_ALLOCATIONS; ++j) {
            // new allocation
            const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ];

            char* pnt = NULL;
            if (USE_NEW) {
                pnt = new char[allocSize];
            } else {
                pnt = (char*) malloc(allocSize);
            }
            pointers.push_back(std::make_pair(pnt, allocSize));

            totalAllocBytes += allocSize;
            currAllocBytes  += allocSize;

            // fill with values to add "delay"
            for (int fill = 0; fill < (int) allocSize; ++fill) {
                pnt[fill] = (char)(j % 255);
            }


            if (DEBUG_ALLOCS) {
                std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n";
            }

            // free all or just a bit
            if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) {
                int frees = 0;

                // keep this much allocated
                // last check, free all
                uint64_t memLimit = MEM_LIMIT;
                if (j == NUM_ALLOCATIONS - 1) {
                    std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl;
                    memLimit = 0;
                }
                //MEM_LIMIT = 0; // DEBUG

                while (pointers.size() > 0 && (currAllocBytes > memLimit)) {
                    // free one of the first entries to allow previously obtained resources to 'live' longer
                    currAllocBytes -= pointers.front().second;
                    char* pnt       = pointers.front().first;

                    // free memory
                    if (USE_NEW) {
                        delete[] pnt;
                    } else {
                        free(pnt);
                    }

                    // update array
                    pointers.pop_front();

                    if (DEBUG_ALLOCS) {
                        std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n";
                    }
                    frees++;
                }
                if (DEBUG_ALLOCS) {
                    std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n";
                }
            }
        } // for each allocation

        if (currAllocBytes != 0) {
            std::cerr << "Not all free'd!\n";
        }

        std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n";
    } // for each iteration

    exit(1);
}

int main(int argc, char** argv)
{
    runParallelAllocTest();

    return 0;
}

The Test-System

From what I see so far, the hardware matters a lot. The test might need adjustments if run on a faster machine.

Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz
Ubuntu 10.04 LTS 64 bit
gcc 4.3, 4.4, 4.6
3988.62 Bogomips

Testing

Once you have executed the makefile, you should get a file named ompmemtest. To query the memory usage over time, I used the following commands:

./ompmemtest &
top -b | grep ompmemtest

Which yields the quite impressive fragmentation or leaking behaviour. The expected memory consumption with 4 threads is 1090 MB, which became 1500 MB over time:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11626 byron     20   0  204m  99m 1000 R   27  2.5   0:00.81 ompmemtest                                                                              
11626 byron     20   0  992m 832m 1004 R  195 21.0   0:06.69 ompmemtest                                                                              
11626 byron     20   0 1118m 1.0g 1004 R  189 26.1   0:12.40 ompmemtest                                                                              
11626 byron     20   0 1218m 1.0g 1004 R  190 27.1   0:18.13 ompmemtest                                                                              
11626 byron     20   0 1282m 1.1g 1004 R  195 29.6   0:24.06 ompmemtest                                                                              
11626 byron     20   0 1471m 1.3g 1004 R  195 33.5   0:29.96 ompmemtest                                                                              
11626 byron     20   0 1469m 1.3g 1004 R  194 33.5   0:35.85 ompmemtest                                                                              
11626 byron     20   0 1469m 1.3g 1004 R  195 33.6   0:41.75 ompmemtest                                                                              
11626 byron     20   0 1636m 1.5g 1004 R  194 37.8   0:47.62 ompmemtest                                                                              
11626 byron     20   0 1660m 1.5g 1004 R  195 38.0   0:53.54 ompmemtest                                                                              
11626 byron     20   0 1669m 1.5g 1004 R  195 38.2   0:59.45 ompmemtest                                                                              
11626 byron     20   0 1664m 1.5g 1004 R  194 38.1   1:05.32 ompmemtest                                                                              
11626 byron     20   0 1724m 1.5g 1004 R  195 40.0   1:11.21 ompmemtest                                                                              
11626 byron     20   0 1724m 1.6g 1140 S  193 40.1   1:17.07 ompmemtest

Please Note: I could reproduce this issue when compiling with gcc 4.3, 4.4 and 4.6(trunk).

When linking the test program with google's tcmalloc library, the executable doesn't only run ~10% faster, but shows greatly reduced or insignificant memory fragmentation as well:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13441 byron     20   0  379m 334m 1220 R  187  8.4   0:02.63 ompmemtestgoogle                                                                        
13441 byron     20   0 1085m 1.0g 1220 R  194 26.2   0:08.52 ompmemtestgoogle                                                                        
13441 byron     20   0 1111m 1.0g 1220 R  195 26.9   0:14.42 ompmemtestgoogle                                                                        
13441 byron     20   0 1131m 1.1g 1220 R  195 27.4   0:20.30 ompmemtestgoogle                                                                        
13441 byron     20   0 1137m 1.1g 1220 R  195 27.6   0:26.19 ompmemtestgoogle                                                                        
13441 byron     20   0 1137m 1.1g 1220 R  195 27.6   0:32.05 ompmemtestgoogle                                                                        
13441 byron     20   0 1149m 1.1g 1220 R  191 27.9   0:37.81 ompmemtestgoogle                                                                        
13441 byron     20   0 1149m 1.1g 1220 R  194 27.9   0:43.66 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  188 28.2   0:49.32 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  194 28.2   0:55.15 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  191 28.2   1:00.90 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  191 28.2   1:06.64 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1356 R  192 28.2   1:12.42 ompmemtestgoogle

From the data I have, the answer appears to be:

Multithreaded access to the heap can emphasize fragmentation if the employed heap library does not deal well with concurrent access and if the processor fails to execute the threads truly concurrently.

The tcmalloc library shows no significant memory fragmentation running the same program that previously caused ~400MB to be lost in fragmentation.

But why does that happen ?

The best idea I have to offer here is some sort of locking artifact within the heap.

The test program will allocate randomly sized blocks of memory, freeing up blocks allocated early in the program to stay within its memory limit. When one thread is in the process of releasing old memory which is in a heap block on the 'left', it might actually be halted as another thread is scheduled to run, leaving a (soft) lock on that heap block. The newly scheduled thread wants to allocate memory, but may not even read that heap block on the 'left' side to check for free memory as it is currently being changed. Hence it might end up using a new heap block unnecessarily from the 'right'.

This process could look like a heap-block-shifting, where the the first blocks (on the left) remain only sparsely used and fragmented, forcing new blocks to be used on the right.

Lets restate that this fragmentation issue only occurs for me if I use 4 or more threads on a dual core system which can only handle two threads more or less concurrently. When only two threads are used, the (soft) locks on the heap will be held short enough not to block the other thread who wants to allocate memory.

Also, as a disclaimer, I didn't check the actual code of the glibc heap implementation, nor am I anything more than novice in the field of memory allocators - all I wrote is just how it appears to me which makes it pure speculation.

Another interesting read might be the tcmalloc documentation, which states common problems with heaps and multi-threaded access, some of which may have played their role in the test program too.

Its worth noting that it will never return memory to the system (see Caveats paragraph in tcmalloc documentation)

Why is this Java code 6x faster than the identical C# code?

30 votes

UPDATE 2:
Changing the target from x86 to anycpu has lowered the average execution time to 84ms per run, down from 282ms. Maybe I should split this off into a second thread?

UPDATE:
Thanks to Femaref below who pointed out some testing problems, and indeed, after following his suggestions, the times are lower, indicating that VM setup time was significant in Java, but probably not in C#. In C#, it was the debug symbols that were significant.

I updated my code to run each loop 10,000 times, and only output the average ms at the end. The only significant change I made was to the C# version where I switched to the StopWatch class for greater resolution. I stuck with milliseconds because it's good enough.

Results:
The testing changes don't explain why Java is (still) so much faster than C#. C# performance was better, but this can be explained entirely by removing the debug symbols. If you read Mike Two and I's exchange on the comments attached to this OP, you'll see I got ~280ms average in five runs of the C# code just by switching from Debug to Release.

Numbers:

  • A 10,000 count loop of the unmodified Java code gave me an average of 45ms (down from 55ms)
  • A 10,000 count loop of the C# code using the StopWatch class gave me an average of 282ms (down from 320ms)

All of this leaves the difference unexplained. In fact, the differential got worse. Java went from being ~5.8x faster to ~6.2x faster.

(Original post below)

I have a few different solutions to Project Euler problem 5, but the execution time difference between the two languages/platforms in this particular implementation intrigues me. I didn't do any optimization with compiler flags, just plain javac (via commandline) and csc (via Visual Studio).

Here's the Java code. It finishes in 55ms.

public class Problem005b
{
    public static void main(String[] args)
    {
        long begin = System.currentTimeMillis();
        int i = 20;
        while (true)
        {
            if (
                    (i % 19 == 0) &&
                    (i % 18 == 0) &&
                    (i % 17 == 0) &&
                    (i % 16 == 0) &&
                    (i % 15 == 0) &&
                    (i % 14 == 0) &&
                    (i % 13 == 0) &&
                    (i % 12 == 0) &&
                    (i % 11 == 0)
                )
            {
                break;
            }
            i += 20;
        }
        long end = System.currentTimeMillis();
        System.out.println(i);
        System.out.println(end-begin + "ms");
    }   
}
Here is the identical C# code. It finishes in 320ms
using System;

namespace ProjectEuler05
{
    class Problem005
    {
        static void Main(String[] args)
        {
            DateTime begin = DateTime.Now;
            int i = 20;
            while (true)
            {
                if (
                        (i % 19 == 0) &&
                        (i % 18 == 0) &&
                        (i % 17 == 0) &&
                        (i % 16 == 0) &&
                        (i % 15 == 0) &&
                        (i % 14 == 0) &&
                        (i % 13 == 0) &&
                        (i % 12 == 0) &&
                        (i % 11 == 0)
                    )
                    {
                        break;
                    }
                i += 20;
            }
            DateTime end = DateTime.Now;
            TimeSpan elapsed = end - begin;
            Console.WriteLine(i);
            Console.WriteLine(elapsed.TotalMilliseconds + "ms");
        }
    }
}

The key to making these two become closer is to ensure that the comparison is fair.

First of all ensuring that costs associated with running Debug builds, loading pdb symbols as you did.

Next you need to ensure that there are no init costs being counted. Obviously these are real costs, and may matter to some people, but in this instance we are interested in the loop itself.

Next you need to deal with the platform specific behaviour. If you are on a 64bit windows machine you may be running either in 32bit or 64bit mode. In 64bit mode the JIT is different in many respects, often altering the resulting code considerably. Specifically, and I would guess pertinently, you get access to twice as many general purpose registers.

In this case the inner section of the loop, when naively translated into machine code, would need to load into registers the constants used in the modulo tests. If there are insufficient to hold everything needed in the loop then it must push them in from memory. Even coming from level1 cache this would be a significant hit compared to keeping it all in registers.

In VS 2010 MS changed the default target from anycpu to x86. I have nothing like the resources or customer facing knowledge of MSFT so I won't try to second guess that. However anyone looking at anything like the performance analysis you are doing should certainly try both.

Once those disparities are ironed out the numbers seem far more reasonable. Any further differences likely require better than educated guesses, instead they would need investigation into the actual differences in the generated machine code.

There are several things about this I think would be interesting for an optimising compiler.

  • The ones finnw already mentioned:
    • The lcm option interesting but I can't see a compiler writer bothering.
    • the reduction of division to multiplication and masking.
      • I don't know enough about this, but other people have tried note that they call out the divider on the more recent intel chips significantly better.
      • Perhaps you could even arrange something complex, with SSE2.
      • Certainly the modulo 16 operation is ripe for conversion into a mask or shift.
    • A compiler could spot that none of the tests have side effects.
      • it could speculatively try to evaluate several of them at once, on a super scalar processor this could pump things along quite a bit faster, but would depend heavily on how well the compilers layout interacted with the OO execution engine.
    • If register pressure was tight you could implement the constants as a single variable, set at the start of each loop then increment as you go along.

These are all utter guesses, and should be viewed as the idle meanderings. If you want to know disassemble it.

C++ constructor question

30 votes

EDIT: This question came up and I think I aced it! Go StackOverflow!! :D

I have exams coming up, and one of the questions on last years exams was to spot the problem with implementation of the following constructor and to write a corrected one.

Rectangle::Rectangle(string col, int len, int br)
{
    setColour(col);
    length =len;
    breadth=br;
}

The class definitions are as follows:

class Polygon
{
public:
    Polygon(string col="red");
    void printDetails(); // prints colour only
    virtual double getArea()=0;
    void setColour(string col);
private:
    string colour;
};


class Rectangle : public Polygon
{
public:
    Rectangle(string, int, int);
    void printDetails(); // prints colour and area
    // for part 3, delete the line below
    double getArea();
private:
    int length;
    int breadth;
};

I've written the code into the compiler and it runs fine. I'm guessing the question is relating to inheritance, since string colour; is private, but setColour is public so I cant figure it out. Unless its Rectangle::Rectangle(string col, int len, int br):length(len), breadth(br) and then set the colour inside the construcor or something.

Its only worth 3 marks so its not that big a deal if nobody wants to answer, but I figure if I'm going to have a career as a programmer, its in my interest to know as much as possible. ;)

Thanks for any help.

PS, see getArea() in Rectangle. When I remove that it tells me it "cannot instantiate the abstract class". What does that mean?

EDIT: Here's the main:

void main (void)
{
    Rectangle rect1 ("blue",5,6);
    Rectangle *prect2 = new Rectangle("red",5,6);
    rect1.setColour("red");
    rect1.printDetails();
    prect2->printDetails();
}

I don't see anything wrong, though you could make it more efficient by using an initialization list (otherwise your private members of both classes get initialized twice):

Rectangle::Rectangle(string col, int len, int br) 
: Polygon(col), length(len), breadth(br)
{

}

Notice that the initialization list can call the constructor of Polygon as well.

See Why should I prefer to use member initialization list? for a complete description of the advantages of using initialization lists.

Avoiding 'instanceof' in Java

30 votes

I have the following (maybe common) problem and it absolutely puzzles me at the moment:

There are a couple of generated event objects which extends the abstract class Event and I want to divide them to Session Beans, like

public void divideEvent(Event event) {
    if (event instanceof DocumentEvent) {
        documentGenerator.gerenateDocument(event);
    } else if (event instanceof MailEvent) {
        deliveryManager.deliverMail(event);
        ...
    }
    ...

}

But there could be more than two event types in future, so the if-else will be long and maybe unreadable. Additionally I think instanceof is not really "best practice" in this case.

I could add an abstract method to the Event type and have them divide itself but then I have to inject the specific Session Beans within each entity.

Is there any hint to achieve a "pretty" solution for this problem?

Thanks for any help!

The simplest approach is to have the Event provide a method you can call so the Event knows what to do.

interface Event {
    public void onEvent(Context context);
}

class DocumentEvent implements Event {
    public void onEvent(Context context) {
         context.getDocumentGenerator().gerenateDocument(this);
    }
}

class MailEvent implements Event {
    public void onEvent(Context context) {
         context.getDeliveryManager().deliverMail(event);
    }
}


class Context {
    public void divideEvent(Event event) {
        event.onEvent(this);
    }
}

String going crazy if I don't give it a little extra room. Can anyone explain what is happening here?

29 votes

First, I'd like to say that I'm new to C / C++, I'm originally a PHP developer so I am bred to abuse variables any way I like 'em.

C is a strict country, compilers don't like me here very much, I am used to breaking the rules to get things done.

Anyway, this is my simple piece of code:

char IP[15] = "192.168.2.1";
char separator[2] = "||";   

puts( separator );

Output:

||192.168.2.1

But if I change the definition of separator to:

char separator[3] = "||";

I get the desired output:

||

So why did I need to give the man extra space, so he doesn't sleep with the man before him?

That's because you get a not null-terminated string when separator length is forced to 2.

Always remember to allocate an extra character for the null terminator. For a string of length N you need N+1 characters.

Once you violate this requirement any code that expects null-terminated strings (puts() function included) will run into undefined behavior.

Your best bet is to not force any specific length:

char separator[] = "||";

will allocate an array of exactly the right size.

FlagsAttribute Enum problems

29 votes

So I'm building an MSNP (windows live messenger) client. And I've got this list of capabilities

public enum UserCapabilities : long
{
    None = 0,
    MobileOnline = 1 << 0,
    MSN8User = 1 << 1,
    RendersGif = 1 << 2,
    ....
    MsgrVersion7 = 1 << 30,
    MsgrVersion8 = 1 << 31,
    MsgrVersion9 = 1 << 32,
}

full list here http://paste.pocoo.org/show/383240/

The server sends each users capabilities to the client as a long integer, which I take and cast it to UserCapabilities

capabilities = Int64.Parse(e.Command.Args[3]);
user._capabilities = (UserCapabilities)capabilities;

This is fine, and with atleast one user (with a capability value of 1879474220), I can do

Debug.WriteLine(_msgr.GetUser(usr).Capabilities);

and this will output

RendersGif, RendersIsf, SupportsChunking, IsBot, SupportsSChannel, SupportsSipInvite, MsgrVersion5, MsgrVersion6, MsgrVersion7

But with another user, who has the capability value of (3055849760), when I do the same, I just get the same number outputted

3055849760

What I would like to be seeing is a list of capabilities, as it is with the other user.

I'm sure there is a very valid reason for this happening, but no matter how hard I try to phrase the question to Google, I am not finding an answer.

Please help me :)

The definition of the shift operators means that only the 5 least significant bits are used for 32-bit numbers and only the first 6 bits for 64-bit; meaning:

1 << 5

is identical to

1 << 37

(both are 32)

By making it:

MsgrVersion9 = 1L << 32

you make it a 64-bit number, which is why @leppie's fix works; otherwise the << is considered first (and note that 1<<32 is identical to 1<<0, i.e. 1), and then the resulting 1 is converted to a long; so it is still 1.

From §14.8 in the ECMA spec:

For the predefined operators, the number of bits to shift is computed as follows:

  • When the type of x is int or uint, the shift count is given by the low-order five bits of count. In other words, the shift count is computed from count & 0x1F.
  • When the type of x is long or ulong, the shift count is given by the low-order six bits of count. In other words, the shift count is computed from count & 0x3F.

If the resulting shift count is zero, the shift operators simply return the value of x.

Shift operations never cause overflows and produce the same results in checked and unchecked context