Best questions in July 2011

Why is subtracting these two times (in 1927) giving a strange result?

220 votes

If I run the following program, which parses two date strings referencing times one second apart and compares them:

public static void main(String[] args) throws ParseException {
    SimpleDateFormat sf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");  
    String str3 = "1927-12-31 23:54:07";  
    String str4 = "1927-12-31 23:54:08";  
    Date sDt3 = sf.parse(str3);  
    Date sDt4 = sf.parse(str4);  
    long ld3 = sDt3.getTime() /1000;  
    long ld4 = sDt4.getTime() /1000; 
    System.out.println(ld3);  
    System.out.println(ld4);  
    System.out.println(ld4-ld3);
}

The output is:

-1325491905
-1325491552
353

Why is ld4-ld3 not 1 (as I would expect from the one-second difference in the times), but 353?

If I change the dates to times a second later:

String str3 = "1927-12-31 23:54:08";  
String str4 = "1927-12-31 23:54:09";  

Then ld4-ld3 will be 1


UPDATE

Java version:

java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
Dynamic Code Evolution Client VM (build 0.2-b02-internal, 19.0-b04-internal, mixed mode)

Timezone(TimeZone.getDefault()):

sun.util.calendar.ZoneInfo[id="Asia/Shanghai",
offset=28800000,dstSavings=0,
useDaylight=false,
transitions=19,
lastRule=null]

Locale(Locale.getDefault()): zh_CN

It's a time zone change on December 31st in Shanghai.

See this page for details of 1927 in Shanghai. Basically at midnight at the end of 1927, the clocks went back 5 minutes and 52 seconds. So "1927-12-31 23:54:08" actually happened twice, and it looks like Java is parsing it as the later possibly instant for that local date/time - hence the difference.

Just another episode in the often weird and wonderful world of time zones.

If strings are immutable in .NET, then why does Substring take O(n) time?

149 votes

Given that strings are immutable in .NET, I'm wondering why they have been designed such that string.Substring() takes O(substring.Length) time, instead of O(1)?

i.e. what were the tradeoffs, if any?

UPDATE: I liked this question so much, I just blogged it. See http://blogs.msdn.com/b/ericlippert/archive/2011/07/19/strings-immutability-and-persistence.aspx


The short answer is: O(n) is O(1) if n does not grow large. Most people extract tiny substrings from tiny strings, so how the complexity grows asymptotically is completely irrelevant.

The long answer is:

An immutable data structure built such that operations on an instance permit re-use of the memory of the original with only a small amount (typically O(1) or O(lg n)) of copying or new allocation is called a "persistent" immutable data structure. Strings in .NET are immutable; your question is essentially "why are they not persistent"?

Because when you look at operations that are typically done on strings in .NET programs, it is in every relevant way hardly worse at all to simply make an entirely new string. The expense and difficulty of building a complex persistent data structure doesn't pay for itself.

People typically use "substring" to extract a short string -- say, ten or twenty characters -- out of a somewhat longer string -- maybe a couple hundred characters. You have a line of text in a comma-separated file and you want to extract the third field, which is a last name. The line will be maybe a couple hundred characters long, the name will be a couple dozen. String allocation and memory copying of fifty bytes is astonishingly fast on modern hardware. That making a new data structure that consists of a pointer to the middle of an existing string plus a length is also astonishingly fast is irrelevant; "fast enough" is by definition fast enough.

The substrings extracted are typically small in size and short in lifetime; the garbage collector is going to reclaim them soon, and they didn't take up much room on the heap in the first place. So using a persistent strategy that encourages reuse of most of the memory is also not a win; all you've done is made your garbage collector get slower because now it has to worry about handling interior pointers.

If the substring operations people typically did on strings were completely different, then it would make sense to go with a persistent approach. If people typically had million-character strings, and were extracting thousands of overlapping substrings with sizes in the hundred-thousand-character range, and those substrings lived a long time on the heap, then it would make perfect sense to go with a persistent substring approach; it would be wasteful and foolish not to. But most line-of-business programmers do not do anything even vaguely like those sorts of things. .NET is not a platform that is tailored for the needs of the Human Genome Project; DNA analysis programmers have to solve problems with those string usage characteristics every day; odds are good that you do not. The few who do build their own persistent data structures that closely match their usage scenarios.

For example, my team writes programs that do on-the-fly analysis of C# and VB code as you type it. Some of those code files are enormous and thus we cannot be doing O(n) string manipulation to extract substrings or insert or delete characters. We have built a bunch of persistent immutable data structures for representing edits to a text buffer that permit us to quickly and efficiently re-use the bulk of the existing string data and the existing lexical and syntactic analyses upon a typical edit. This was a hard problem to solve and its solution was narrowly tailored to the specific domain of C# and VB code editing. It would be unrealistic to expect the built-in string type to solve this problem for us.

Why does Google +1 record my mouse movements?

129 votes

This is only on pages with a Google +1 box on my website:

enter image description here

It seems to be firing off an event on every mouse move. Anyone know what it is doing? I searched on Google (perhaps I should try Bing for once on this one!) but no one seems to have written about it. Is it recording information about my visitors browsing habits? Is it some sort of CAPTCHA to detect human like behviour?

Example URL, press F12 in chrome, go to timeline and press record, then move your mouse around this page (it plus ones this question, don't worry):

https://plusone.google.com/u/0/_/+1/button?hl=en-US&jsh=r%3Bgc%2F22224365-adc8a19e#url=http://stackoverflow.com/questions/6667544/google-1-recording-mouse-move&size=tall&count=true&id=I1_1310488711647&parent=https://plusone.google.com/u/0/_/+1/button?hl=en-US&jsh=r%3Bgc%2F22224365-adc8a19e#url=http://stackoverflow.com/questions/6667544/google-1-recording-mouse-move&size=tall&count=true&id=I1_1310488711647

For what it's worth (I can see this is going to be a popular question), I don't think there is anything sinister behind it, it might even be a useless artifact/bug, but if it is doing some sort of tracking, well, it seems a little deceptive to me.

Google +1 privacy policy

http://www.google.com/intl/en/privacy/plusone/

Google +1 Button Privacy Policy

June 28, 2011

The Google Privacy Policy describes how we treat personal information when you use Google’s products and services, including information provided when you use the Google +1 button. In addition, the following describes our additional privacy practices specific to your use of the +1 button.

Information we collect and how it is shared

The Google +1 button is a way for you to share information publicly with the world. The Google +1 button helps you and others receive personalized content from Google and our partners. The fact that you +1’d something will be recorded by Google, along with information about the page you were viewing when you clicked on the +1 button. Your +1’s may appear to others as an annotation with your profile name and photo in Google services (such as in search results or on your Google Profile) or elsewhere on websites and ads on the Internet.

We will record information about your +1 activity in order to provide you and other users with a better experience on Google services.

In order to use the Google +1 button, you need to have a public Google Profile visible to the world, which at a minimum includes the name you chose for the profile. That name will be used across Google services and in some cases it may replace another name you’ve used when sharing content under your Google Account. We may display your Google Profile identity to people who have your email address or other identifying information.

Use of the collected information

In addition to the above-described uses, the information you provide to us is used subject to our main Google Privacy Policy.

We may share aggregate statistics related to users’ +1 activity with the public, our users, and partners, such as publishers, advertisers, or connected sites. For example, we may tell a publisher that “10% of the people who +1’d this page are in Tacoma, Washington.”

Your choices

You may view the list of items you have +1’d on the +1 tab on your Profile. You can remove individual items from that list.

You may opt out of seeing +1 recommendations on third-party websites (including on ads on third-party sites) from people you know.

We will store data (such as your recent +1’s) locally in your browser. You may be able to access and clear this information in your browser settings.

More information

Google adheres to the U.S. Safe Harbor privacy principles. For more information about the Safe Harbor framework or our registration, see the Department of Commerce’s website.

Edit

Adding a 500 rep bounty if anyone can work out why and/or what they are collecting.

It appears to be seeding a random number generator with your mouse movements.

The mouse move handler itself does something along the lines of the following:

var b = ((event.X << 16) + event.Y) * (new Date().getTime() % 1000000);
c = c * b % d;
if (previousMouseMoveHandler) previousMouseMoveHandler.call(arguments);

d is (screen.width * screen.width + screen.height) * 1000000, and c is a variable that starts out as 1.

All of this is wrapped in the scope of an anonymous function, which itself is immediately evaluated to return a function that is assigned to a property named "random". That returned function looks something like this:

var b = c;
b += parseInt(hash.substr(0,20), 16);
hash = MD5(hash);
return b / (d + Math.pow(16, 20));

hash, BTW, is a variable that starts out as the MD5 hash of the page's cookies, location, the new Date().getTime(), and Math.random().

(Note, of course, that Google may change the script returned at any time and hence invalidate this analysis)

Is 0 a decimal literal or an octal literal?

104 votes

Zero is always zero, so it doesn't matter. But in a recent discussion with my friend he said that octal literals are almost unused today. Then it dawned upon me that actually almost all integer literals in my code are octal, namely 0. Is 0 an octal literal according to the C++ grammar? I'm just curious what the standard says.

Yes, 0 is an Octal literal in C++.

As per the C++ Standard:

2.14.2 Integer literals [lex.icon]

integer-literal:  
    decimal-literal integer-suffixopt  
    octal-literal integer-suffixopt  
    hexadecimal-literal integer-suffixopt  
decimal-literal:  
    nonzero-digit  
    decimal-literal digit  
octal-literal:  
    0                           <--------------------<Here>
    octal-literal octal-digit

Convert RGB-->RGBA

93 votes

I have a hex color, e.g. #F4F8FB (or rgb(244, 248, 251)) that I want converted into an as-transparent-as-possible rgba color (when displayed over white). Make sense? I'm looking for an algorithm, or at least idea of an algorithm for how to do so.

For Example:

rgb( 128, 128, 255 ) --> rgba(   0,   0, 255,  .5 )
rgb( 152, 177, 202 ) --> rgba(  50, 100, 150,  .5 ) // can be better(lower alpha)

Ideas?


FYI solution based on Guffa's answer:

function RGBtoRGBA(r, g, b){
    if((g==void 0) && (typeof r == 'string')){
        r = r.replace(/^\s*#|\s*$/g, '');
        if(r.length == 3){
            r = r.replace(/(.)/g, '$1$1');
        }
        g = parseInt(r.substr(2, 2), 16);
        b = parseInt(r.substr(4, 2), 16);
        r = parseInt(r.substr(0, 2), 16);
    }

    var min, a = ( 255 - (min = Math.min(r, g, b)) ) / 255;

    return {
        r    : r = 0|( r - min ) / a,
        g    : g = 0|( g - min ) / a,
        b    : b = 0|( b - min ) / a,
        a    : a = (0|1000*a)/1000,
        rgba : 'rgba(' + r + ', ' + g + ', ' + b + ', ' + a + ')'
    };
}

RGBtoRGBA(204, 153, 102) == RGBtoRGBA('#CC9966') == RGBtoRGBA('C96') == 
    {
       r    : 170,
       g    : 85 ,
       b    : 0  ,
       a    : 0.6,
       rgba : 'rgba(170, 85, 0, 0.6)' 
    }

Take the lowest color component, and convert that to an alpha value. Then scale the color components by subtracting the lowest, and dividing by the alpha value.

Example:

152 converts to an alpha value of (255 - 152) / 255 ~ 0.404

152 scales using (152 - 152) / 0.404 = 0
177 scales using (177 - 152) / 0.404 ~ 62
202 scales using (202 - 152) / 0.404 ~ 123

So, rgb(152, 177, 202) displays as rgba(0, 62, 123, .404).

I have verified in Photoshop that the colors actually match perfectly.

Are "while(true)" loops so bad?

83 votes

I've been programming in Java for several years now, but I just recently returned to school to get a formal degree. I was quite surprised to learn that, on my last assignment, I lost points for using a loop like the one below.

do{
     //get some input.
     //if the input meets my conditions, break;
     //Otherwise ask again.
} while(true)

Now for my test I'm just scanning for some console input, but I was told that this kind of loop is discouraged because using break is akin to goto, we just don't do it.

I understand fully the pitfalls of goto and its Java cousin break:label, and I have the good sense not to use them. I also realize that a more complete program would provide some other means of escape, say for instance to just end the program, but that wasn't a reason my professor cited, so...

What's wrong with do-while(true)?

AFAIK nothing, really. Teachers are just allergic to goto, because they heard somewhere it's really bad. Otherwise you would just write

bool guard = true;
do
{
   getInput();
   if (something)
     guard = false;
} while (guard)

which is actually exactly the same thing. (EDIT well, almost...)

Maybe cleaner(because all the looping info is contained at the top of the block) is

for (bool endLoop = false;!endLoop;)
{

}

Is 'switch' faster than 'if'?

73 votes

Is a switch statement actually faster than an if statement?

I ran the code below on Visual Studio 2010's x64 C++ compiler with the /Ox flag:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define MAX_COUNT (1 << 29)
size_t counter = 0;

size_t testSwitch()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        switch (counter % 4 + 1)
        {
            case 1: counter += 4; break;
            case 2: counter += 3; break;
            case 3: counter += 2; break;
            case 4: counter += 1; break;
        }
    }
    return 1000 * (clock() - start) / CLOCKS_PER_SEC;
}

size_t testIf()
{
    clock_t start = clock();
    size_t i;
    for (i = 0; i < MAX_COUNT; i++)
    {
        const size_t c = counter % 4 + 1;
        if (c == 1) { counter += 4; }
        else if (c == 2) { counter += 3; }
        else if (c == 3) { counter += 2; }
        else if (c == 4) { counter += 1; }
    }
    return 1000 * (clock() - start) / CLOCKS_PER_SEC;
}

int main()
{
    printf("Starting...\n");
    printf("Switch statement: %u ms\n", testSwitch());
    printf("If     statement: %u ms\n", testIf());
}

and got these results:

Switch statement: 5261 ms
If statement: 5196 ms

From what I've learned, switch statements apparently use jump tables to optimize the branching.

Questions:

  1. What would a basic jump table look like, in x86 or x64?

  2. Is this code using a jump table?

  3. Why is there no performance difference in this example? Is there any situation in which there is a significant performance difference?


Disassembly of the code:

testIf:

13FE81B10 sub  rsp,48h 
13FE81B14 call qword ptr [__imp_clock (13FE81128h)] 
13FE81B1A mov  dword ptr [start],eax 
13FE81B1E mov  qword ptr [i],0 
13FE81B27 jmp  testIf+26h (13FE81B36h) 
13FE81B29 mov  rax,qword ptr [i] 
13FE81B2E inc  rax  
13FE81B31 mov  qword ptr [i],rax 
13FE81B36 cmp  qword ptr [i],20000000h 
13FE81B3F jae  testIf+0C3h (13FE81BD3h) 
13FE81B45 xor  edx,edx 
13FE81B47 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B4E mov  ecx,4 
13FE81B53 div  rax,rcx 
13FE81B56 mov  rax,rdx 
13FE81B59 inc  rax  
13FE81B5C mov  qword ptr [c],rax 
13FE81B61 cmp  qword ptr [c],1 
13FE81B67 jne  testIf+6Dh (13FE81B7Dh) 
13FE81B69 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B70 add  rax,4 
13FE81B74 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81B7B jmp  testIf+0BEh (13FE81BCEh) 
13FE81B7D cmp  qword ptr [c],2 
13FE81B83 jne  testIf+89h (13FE81B99h) 
13FE81B85 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81B8C add  rax,3 
13FE81B90 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81B97 jmp  testIf+0BEh (13FE81BCEh) 
13FE81B99 cmp  qword ptr [c],3 
13FE81B9F jne  testIf+0A5h (13FE81BB5h) 
13FE81BA1 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81BA8 add  rax,2 
13FE81BAC mov  qword ptr [counter (13FE835D0h)],rax 
13FE81BB3 jmp  testIf+0BEh (13FE81BCEh) 
13FE81BB5 cmp  qword ptr [c],4 
13FE81BBB jne  testIf+0BEh (13FE81BCEh) 
13FE81BBD mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81BC4 inc  rax  
13FE81BC7 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81BCE jmp  testIf+19h (13FE81B29h) 
13FE81BD3 call qword ptr [__imp_clock (13FE81128h)] 
13FE81BD9 sub  eax,dword ptr [start] 
13FE81BDD imul eax,eax,3E8h 
13FE81BE3 cdq       
13FE81BE4 mov  ecx,3E8h 
13FE81BE9 idiv eax,ecx 
13FE81BEB cdqe      
13FE81BED add  rsp,48h 
13FE81BF1 ret       

testSwitch:

13FE81C00 sub  rsp,48h 
13FE81C04 call qword ptr [__imp_clock (13FE81128h)] 
13FE81C0A mov  dword ptr [start],eax 
13FE81C0E mov  qword ptr [i],0 
13FE81C17 jmp  testSwitch+26h (13FE81C26h) 
13FE81C19 mov  rax,qword ptr [i] 
13FE81C1E inc  rax  
13FE81C21 mov  qword ptr [i],rax 
13FE81C26 cmp  qword ptr [i],20000000h 
13FE81C2F jae  testSwitch+0C5h (13FE81CC5h) 
13FE81C35 xor  edx,edx 
13FE81C37 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C3E mov  ecx,4 
13FE81C43 div  rax,rcx 
13FE81C46 mov  rax,rdx 
13FE81C49 inc  rax  
13FE81C4C mov  qword ptr [rsp+30h],rax 
13FE81C51 cmp  qword ptr [rsp+30h],1 
13FE81C57 je   testSwitch+73h (13FE81C73h) 
13FE81C59 cmp  qword ptr [rsp+30h],2 
13FE81C5F je   testSwitch+87h (13FE81C87h) 
13FE81C61 cmp  qword ptr [rsp+30h],3 
13FE81C67 je   testSwitch+9Bh (13FE81C9Bh) 
13FE81C69 cmp  qword ptr [rsp+30h],4 
13FE81C6F je   testSwitch+0AFh (13FE81CAFh) 
13FE81C71 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C73 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C7A add  rax,4 
13FE81C7E mov  qword ptr [counter (13FE835D0h)],rax 
13FE81C85 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C87 mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81C8E add  rax,3 
13FE81C92 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81C99 jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81C9B mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81CA2 add  rax,2 
13FE81CA6 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81CAD jmp  testSwitch+0C0h (13FE81CC0h) 
13FE81CAF mov  rax,qword ptr [counter (13FE835D0h)] 
13FE81CB6 inc  rax  
13FE81CB9 mov  qword ptr [counter (13FE835D0h)],rax 
13FE81CC0 jmp  testSwitch+19h (13FE81C19h) 
13FE81CC5 call qword ptr [__imp_clock (13FE81128h)] 
13FE81CCB sub  eax,dword ptr [start] 
13FE81CCF imul eax,eax,3E8h 
13FE81CD5 cdq       
13FE81CD6 mov  ecx,3E8h 
13FE81CDB idiv eax,ecx 
13FE81CDD cdqe      
13FE81CDF add  rsp,48h 
13FE81CE3 ret       

Update:

Interesting results here and here. Not sure why one is faster and one is slower, though.

There are several optimizations a compiler can make on a switch. I don't think the oft-mentioned "jump-table" is a very useful one though, as it only works when the input can be bounded some way.

C Pseudocode for a "jump table" would be something like this -- note that the compiler in practice would need to insert some form of if test around the table to ensure that the input was valid in the table. Note also that it only works in the specific case that the input is a run of consecutive numbers.

Moreover, on modern CPUs, the cache locality cost of storing the jump table may often be greater than the elided IF tests.

If the number of branches in a switch is extremely large, a compiler can do things like using binary search on the values of the switch, which (in my mind) would be a much more useful optimization, as it does significantly increase performance in some scenarios, is as general as a switch is, and does not result in greater generated code size. But to see that, your test code would need a LOT more branches to see any difference.

To answer your specific questions:

  1. I don't know x86 assembler, sorry. :(
  2. I can say that it is not using a jump table -- 4 comparison instructions are clearly visible:

    13FE81C51 cmp  qword ptr [rsp+30h],1 
    13FE81C57 je   testSwitch+73h (13FE81C73h) 
    13FE81C59 cmp  qword ptr [rsp+30h],2 
    13FE81C5F je   testSwitch+87h (13FE81C87h) 
    13FE81C61 cmp  qword ptr [rsp+30h],3 
    13FE81C67 je   testSwitch+9Bh (13FE81C9Bh) 
    13FE81C69 cmp  qword ptr [rsp+30h],4 
    13FE81C6F je   testSwitch+0AFh (13FE81CAFh) 
    

    A jump table based solution does not use comparison at all.

  3. Either not enough branches to cause the compiler to generate a jump table, or your compiler simply doesn't generate them. I'm not sure which.

When I `throw` something, where is it stored in memory?

66 votes

I understand that when something is thrown, the stack is 'unwound' to the point where it is caught, and the destructors of class instances on the stack in each function context are run (which is why you should not throw an exception from a destructor - you could end up throwing a second one)...but I wonder where in memory the object that I have thrown is stored while this happens?

Is it implementation dependent? If so, is there a particular method used by most popular compilers?

Yes, the answer is compiler-dependent.

A quick experiment with my compiler (g++ 4.4.3) reveals that its runtime library first tries to malloc memory for the exception and, failing that, attempts to allocate space within a process-wide "emergency buffer" that lives on the data segment. If that doesn't work out, it calls std::terminate().

It would appear that the main purpose of the emergency buffer is to be able to throw std::bad_alloc after the process has run out of heap space (in which case the malloc call would fail).

The relevant function is __cxa_allocate_exception:

extern "C" void *
__cxxabiv1::__cxa_allocate_exception(std::size_t thrown_size) throw()
{
  void *ret;

  thrown_size += sizeof (__cxa_refcounted_exception);
  ret = malloc (thrown_size);

  if (! ret)
    {
      __gnu_cxx::__scoped_lock sentry(emergency_mutex);

      bitmask_type used = emergency_used;
      unsigned int which = 0;

      if (thrown_size > EMERGENCY_OBJ_SIZE)
        goto failed;
      while (used & 1)
        {
          used >>= 1;
          if (++which >= EMERGENCY_OBJ_COUNT)
            goto failed;
        }

      emergency_used |= (bitmask_type)1 << which;
      ret = &emergency_buffer[which][0];

    failed:;

      if (!ret)
        std::terminate ();
    }

  // We have an uncaught exception as soon as we allocate memory.  This
  // yields uncaught_exception() true during the copy-constructor that
  // initializes the exception object.  See Issue 475.
  __cxa_eh_globals *globals = __cxa_get_globals ();
  globals->uncaughtExceptions += 1;

  memset (ret, 0, sizeof (__cxa_refcounted_exception));

  return (void *)((char *)ret + sizeof (__cxa_refcounted_exception));
}

I don't know how typical this scheme is.

Why is "int i = 2147483647 + 1;" OK, but "byte b = 127 + 1;" is not compilable?

66 votes

Why is int i = 2147483647 + 1; OK, but byte b = 127 + 1; is not compilable?

Constants are evaluated as ints, so 2147483647 + 1 overflows and gives you a new int, which is assignable to int, while 127 + 1 also evaluated as int equals to 128, and it is not assignable to byte.

What is a good choice of database for a small .NET application?

65 votes

I'm developing some small application with C# in .NET and I wanna have some small local Database next to it where I can save and retrieve records by Sql queries. I don't need anything powerful, just something to use instead of keeping records in files like .txt. so what's your suggestion for that? thanks p.s. I already tried to use .mdf and .sdf , but no success, do you think they work well too? how?

You have a couple of immediately recognisable and free options:

The SQL Server Compact download comes with the ADO.NET provider that you will need to reference in code. The SQLite download might not have it so here is a link:

http://sqlite.phxsoftware.com/

They both use SQL, though likely with a few limitations / quirks. Management Studio works with Compact, whereas with SQLite you will need another UI tool such as SQLite Administrator:

http://sqliteadmin.orbmu2k.de/

There are NoSQL alternatives, such as:

Personally I would avoid using MS Access in the face of other free options. You cannot go wrong with either Compact or SQLite, they are both lovely small databases that run relatively quickly in little RAM - personal preference as to the religious aspects about liking a Microsoft product I suppose :-)

I use Sterling for Windows Phone programming as it is built to use Isolated Storage. I have only seen articles on RavenDb, but I can tell you that it is a JSON based document storage framework.

Not to confuse the situation (go with SQLite or SQL Server Compact), but there are other embedded / local databases out there, some are relational others are object-oriented:

Not all of these are free. SQL / LINQ / in-proc support differs across them all. This list is just for curiosity.

There is now also Karvonite, however the code gallery link is broken. When it's live again I'll be looking into this one for WP7 development.

62 votes

This was a question I was asked at my recent interview and I want to know (I don't actually remember the theory of the numerical analysis, so please help me :)

If we have some function, which accumulates floating-point numbers:

std::accumulate(v.begin(), v.end(), 0.0);

v is a std::vector<float>, for example.

  • Would it be better to sort these numbers before accumulating them?

  • In which order the answer would be the most precise?

I suspect that sorting the numbers in ascending order would actually make the numerical error less, but unfortunately I can't prove it myself.

P.S. I do realize this probably has nothing to do with real world programming, just being curious.

You're right, sort in ascending order (of magnitude). Consider the case where we're adding single-precision (32 bit) floats, and there are 1 billion values equal to 1 / (1 billion), and one value equal to 1. If the 1 comes first, then the sum will come to 1, since 1 + (1 / 1 billion) is 1 due to loss of precision. Each addition has no effect at all on the total.

If the small values come first, they will at least sum to something, although even then I have 2^30 of them, whereas after 2^25 or so I'm back in the situation where each one individually isn't affecting the total any more. So I'm still going to need more tricks.

That's an extreme case, but in general adding two values of similar magnitude is more accurate than adding two values of very different magnitudes, since you "discard" fewer bits of precision in the smaller value that way. By sorting the numbers, you group values of similar magnitude together, and by adding them in ascending order you give the small values a "chance" of cumulatively reaching the magnitude of the bigger numbers.

In fact, you can play a lot more tricks than just adding them in sorted order. If you have lots of very small values, a middle number of middling values, and a small number of large values, then it might be most accurate to first add up all the small ones, then separately total the middling ones, add those two totals together then add the large ones. It's not at all trivial to find the most accurate combination of floating-point additions, but to cope with really bad cases you can keep a whole array of running totals at different magnitudes, add each new value to the total that best matches its magnitude, and when a running total starts to get too big for its magnitude, add it into the next total up and start a new one. Taken to its logical extreme, this process is equivalent to performing the sum in an arbitrary-precision type (so you'd do that). But given the simplistic choice of adding in ascending or descending order of magnitude, ascending is the better bet.

It does have some relation to real-world programming, since there are some cases where your calculation can go very badly wrong if you accidentally chop off a "heavy" tail consisting of a large number of values each of which is too small to individually affect the sum, or if you throw away too much precision from a lot of small values that individually only affect the last few bits of the sum. You probably don't care if the tail is negligible anyway, for example if you're only adding together a small number of values in the first place.

Standard use of 'Z' instead of NULL to represent missing data?

59 votes

Outside of the argument of whether or not NULLs should ever be used: I am responsible for an existing database that uses NULL to mean "missing or never entered" data. It is different from empty string, which means "a user set this value, and they selected 'empty'."

Another contractor on the project is firmly on the "NULLs do not exist for me; I never use NULL and nobody else should, either" side of the argument. However, what confuses me is that since the contractor's team DOES acknowledge the difference between "missing/never entered" and "intentionally empty or indicated by the user as unknown," they use a single character 'Z' throughout their code and stored procedures to represent "missing/never entered" with the same meaning as NULL throughout the rest of the database.

Although our shared customer has asked for this to be changed, and I have supported this request, the team cites this as "standard practice" among DBAs far more advanced than I; they are reluctant to change to use NULLs based on my ignorant request alone. So, can anyone help me overcome my ignorance? Is there any standard, or small group of individuals, or even a single loud voice among SQL experts which advocates the use of 'Z' in place of NULL?

Update

I have a response from the contractor to add. Here's what he said when the customer asked for the special values to be removed to allow NULL in columns with no data:

Basically, I designed the database to avoid NULLs whenever possible. Here is the rationale:

A NULL in a string [VARCHAR] field is never necessary because an empty (zero-length) string furnishes exactly the same information.

A NULL in an integer field (e.g., an ID value) can be handled by using a value that would never occur in the data (e.g, -1 for an integer IDENTITY field).

A NULL in a date field can easily cause complications in date calculations. For example, in logic that computes date differences, such as the difference in days between a [RecoveryDate] and an [OnsetDate], the logic will blow up if one or both dates are NULL -- unless an explicit allowance is made for both dates being NULL. That's extra work and extra handling. If "default" or "placeholder" dates are used for [RecoveryDate] and [OnsetDate] (e.g., "1/1/1900") , mathematical calculations might show "unusual" values -- but date logic will not blow up.

NULL handling has traditionally been an area where developers make mistakes in stored procedures.

In my 15 years as a DBA, I've found it best to avoid NULLs wherever possible.

This seems to validate the mostly negative reaction to this question. Instead of applying an accepted 6NF approach to designing out NULLs, special values are used to "avoid NULLs wherever possible." I posted this question with an open mind, and I am glad I learned more about the "NULLs are useful / NULLs are evil" debate, but I am now quite comfortable labeling the 'special values' approach to be complete nonsense.

an empty (zero-length) string furnishes exactly the same information.

No, it doesn't; in the existing database we are modifying, NULL means "never entered" and empty string means "entered as empty".

NULL handling has traditionally been an area where developers make mistakes in stored procedures.

Yes, but those mistakes have been made thousands of times by thousands of developers, and the lessons and caveats for avoiding those mistakes are known and documented. As has been mentioned here: whether you accept or reject NULLs, representation of missing values is a solved problem. There is no need to invent a new solution just because developers continue make easy-to-overcome (and easy-to-identify) mistakes.


As a footnote: I have been a DBE and developer for more than 20 years (which is certainly enough time for me to know the difference beetween a database engineer and a database administrator). Throughout my career I have always been in the "NULLs are useful" camp, though I was aware that several very smart people disagreed. I was extremely skeptical about the "special values" approach, but not well-versed enough in the academics of "How To Avoid NULL the Right Way" to make a firm stand. I always love learning new things—and I still have lots to learn after 20 years. Thanks to all who contributed to make this a useful discussion.

Sack your contractor.

Okay, seriously, this isn't standard practice. This can be seen simply because all RDBMS that I have ever worked with implement NULL, logic for NULL, take account of NULL in foreign keys, have different behaviour for NULL in COUNT, etc, etc.

I would actually contend that using 'Z' or any other place holder is worse. You still require code to check for 'Z'. But you also need to document that 'Z' doesn't mean 'Z', it means something else. And you have to ensure that such documentation is read. And then what happens if 'Z' ever becomes a valid piece of data? (Such as a field for an initial?)

At a basic level, even without debating the validity of NULL vs 'Z', I would insist that the contractor conforms to standard practices that exist within your company, not his. Instituting his standard practice in an environment with an alternative standard practice will cause confusion, maintenance overheads, mis-understanding, and in the end increased costs and mistakes.


EDIT

There are cases where using an alternative to NULL is valid in my opinion. But only where doing so reduces code, rather than creating special cases which require accounting for.

I've used that for date bound data, for example. If data is valid between a start-date and an end-date, code can be simplified by not having NULL values. Instead a NULL start-date could be replaced with '01 Jan 1900' and a NULL end-date could be replaced with '31 Dec 2079'.

This still can change behaviour from what may be expected, and so should be used with care:

  • WHERE end-date IS NULL no longer give data that is still valid
  • You just created your own millennium bug
  • etc.

This is equivalent to reforming abstractions such that all properties can always have valid values. It is markedly different from implicitly encoding specific meaning into arbitrarily chosen values.

Still, sack the contractor.

Is it OK not to handle returned value of a C# method? What is good practice in this example?

50 votes

Out of curiosity...what happens when we call a method that returns some value but we don't handle/use it? And we also expect that sometimes this returned value could be really big. Where that value goes? Is it even created? If it is, are there any performance issues or other problems that can occur? (what is the best practice in this kind of situation?)

Let's say we have method that does some database operations (insert, update) and returns some data in DataTable object. And I also know that this DataTable object could be really big sometimes:

public static Datatable InsertIntoDB(...) 
{
      // executing db command, getting values, creating & returning Datatable object...
      ...
      return myDataTable;
}

And then when this method is used it is called like these:

DataTable myDataTable = InsertIntoDB(...);
// this Datatable object is handled in some way

But sometimes simply like this:

InsertIntoDB(...);
// returned value not handled; Problem???

On my first thought it think the system is smart enough to see the returned value is ignored and does not cause any problems (it is simply released) but I want to be sure and hear more detailed explanation of it from someone who is more experienced in this area than me.

The returned value (or reference, if it's a reference type) is pushed onto the stack and then popped off again.

No biggy.

If the return value isn't relevant, you can safely do this.

But be sure that it isn't relevant, just in case.

Here's some code:

    static string GetSomething()
    {
        return "Hello";
    }

    static void Method1()
    {
        string result = GetSomething();
    }

    static void Method2()
    {
        GetSomething();
    }

If we look at the IL:

Method1:

.locals init ([0] string result)
IL_0000:  nop
IL_0001:  call       string ConsoleApplication3.Program::GetSomething()
IL_0006:  stloc.0
IL_0007:  ret

Method2:

IL_0000:  nop
IL_0001:  call       string ConsoleApplication3.Program::GetSomething()
IL_0006:  pop
IL_0007:  ret

Exactly the same number of instructions. In Method1, the value is stored in the local string result (stloc.0), which is deleted when it goes out of scope. In Method2, the pop operation simply removes it from the stack.

In your case of returning something 'really big', that data has already been created and the method returns a reference to it; not the data itself. In Method1(), the reference is assigned to the local variable and the garbage collector will tidy it up after the variable has gone out of scope (the end of the method in this case). In Method2(), the garbage collector can get to work, any time after the reference has been popped from the stack.

By ignoring the return value, if it really isn't needed, the garbage collector can potentially get to work sooner and release any memory that's been assigned. But there's very little in it (certainly in this case), but with a long running method, hanging onto that data could be an issue.

But far-and-away the most important thing is to be sure that the return value that you're ignoring isn't something that you should be acting on.

Is there an alternative to bastard injection? (AKA poor man's injection via default constructor)

47 votes

I most commonly am tempted to use "bastard injection" in a few cases. When I have a "proper" dependency-injection constructor:

public class ThingMaker {
    ...
    public ThingMaker(IThingSource source){
        _source = source;
    }

But then, for classes I am intending as public APIs (classes that other development teams will consume), I can never find a better option than to write a default "bastard" constructor with the most-likely needed dependency:

    public ThingMaker() : this(new DefaultThingSource()) {} 
    ...
}

The obvious drawback here is that this creates a static dependency on DefaultThingSource; ideally, there would be no such dependency, and the consumer would always inject whatever IThingSource they wanted. However, this is too hard to use; consumers want to new up a ThingMaker and get to work making Things, then months later inject something else when the need arises. This leaves just a few options in my opinion:

  1. Omit the bastard constructor; force the consumer of ThingMaker to understand IThingSource, understand how ThingMaker interacts with IThingSource, find or write a concrete class, and then inject an instance in their constructor call.
  2. Omit the bastard constructor and provide a separate factory, container, or other bootstrapping class/method; somehow make the consumer understand that they don't need to write their own IThingSource; force the consumer of ThingMaker to find and understand the factory or bootstrapper and use it.
  3. Keep the bastard constructor, enabling the consumer to "new up" an object and run with it, and coping with the optional static dependency on DefaultThingSource.

Boy, #3 sure seems attractive. Is there another, better option? #1 or #2 just don't seem worth it.

As far as I understand, this question relates to how to expose a loosely coupled API with some appropriate defaults. In this case, you may have a good Local Default, in which case the dependency can be regarded as optional. One way to deal with optional dependencies is to use Property Injection instead of Constructor Injection - in fact, this is sort of the poster scenario for Property Injection.

However, the real danger of Bastard Injection is when the default is a Foreign Default, because that would mean that the default constructor drags along an undesirable coupling to the assembly implementing the default. As I understand this question, however, the intended default would originate in the same assembly, in which case I don't see any particular danger.

In any case you might also consider a Facade as described in one of my earlier answers: Dependency Inject (DI) "friendly" library

BTW, the terminology used here is based on the pattern language from my book.

Difference between covariance and upcasting

43 votes

What is the difference between covariance and upcasting, or, more specifically, why are they given different names?

I've seen the following example referred to as 'upcasting':

string s = "hello";
object o = s;  //upcast to 'string' to 'object'

Whereas, the following I have seen called 'covariance':

string[] s = new string[100];
object[] o = s;

IEnumerable<string> ies = new List<string>();
IEnumerable<object> ieo = ies;

Now, to my untrained eye, covariance seems to be the same as upcasting, except that it refers the casting of collections. (And of a similar statement can be made regarding contravariance and downcasting).

Is it really that simple?

Now, to my untrained eye, covariance seems to be the same as upcasting, except that it refers the casting of collections. (And of a similar statement can be made regarding contravariance and downcasting).

Is it really that simple?

Covariance isn't about upcasting, although I can see why you think it's related.

Covariance is about the following very simple idea. Let's say you have a variable derivedSequence of type IEnumerable<Derived>. Let's say you have a variable baseSequence of type IEnumerable<Base>. Here, Derived derives from Base. Then, with covariance, the following is a legal assignment, and an implicit reference conversion occurs:

baseSequence = derivedSequence;

Note that this is not upcasting. It is not the case that IEnumerable<Derived> derives from IEnumerable<Base>. Rather, it is covariance that allows you to assign the value of the variable derivedSequence to the variable baseSequence. The idea is that variables of type Base can be assigned from objects of type Derived, and since IEnumerable<T> is covariant in its parameter, objects of type IEnumerable<Derived> can be assigned to variables of type IEnumerable<Base>.

Of course, I haven't yet really explained what covariance is. In general, covariance is about the following simple idea. Let's say you have a mapping F from types to types (I'll denote this mapping by F<T>; given a type T its image under the mapping F is F<T>.) Let's say that this mapping has the following very special property:

if X is assignment compatible with Y, then F<X> is assignment compatible with F<Y> as well.

In this case, we say that F is covariant in its parameter T. (Here, to say that "A is assignment compatible with B" where A and B are reference types means that instances of B can be stored in variables of type A.)

In our case, IEnumerable<T> in C# 4.0, an implicit reference conversion from instances of IEnumerable<Derived> to IEnumerable<Base> if Derived is derived from Base. The direction of assignment compatibility is preserved, and this is why we say that IEnumerable<T> is covariant in its type parameter.

What is the better approach to convert primitive data type into String

39 votes

I can convert an integer into string using

String s = "" + 4; // correct, but poor style
or
String u = Integer.toString(4); // this is good

I can convert a double into string using

String s = "" + 4.5; // correct, but poor style
or
String u = Double.toString(4.5); // this is good

I can use String s = "" + dataapproach to convert either an int or double into String. While If I wants to use the other approach using toString() I have to use the Wrapper class of each data type. Then why in some books it is mentioned that the first approach is poor one while the second one is the better. Which one is the better approach and why?

I would use

String.valueOf(...)

You can use the same code for all types, but without the hideous and pointless string concatenation.

Note that it also says exactly what you want - the string value corresponding to the given primitive value. Compare that with the "" + x approach, where you're applying string concatenation even though you have no intention of concatenating anything, and the empty string is irrelevant to you. (It's probably more expensive, but it's the readability hit that I mind more than performance.)

GNU GCC (g++): Why does it generate multiple dtors?

35 votes

Developing environment: GNU GCC (g++) 4.1.2

While I'm trying to investigate how to increase 'code coverage - particularly function coverage' in unit testing, I've found that some of class dtor seems to be generated multiple times. Does some of you have any idea on why, please?

I tried and observed what I mentioned the above by using the following code.

In "test.h"

class BaseClass
{
public:
    ~BaseClass();
    void someMethod();
};

class DerivedClass : public BaseClass
{
public:
    virtual ~DerivedClass();
    virtual void someMethod();
};

In "test.cpp"

#include <iostream>
#include "test.h"

BaseClass::~BaseClass()
{
    std::cout << "BaseClass dtor invoked" << std::endl;
}

void BaseClass::someMethod()
{
    std::cout << "Base class method" << std::endl;
}

DerivedClass::~DerivedClass()
{
    std::cout << "DerivedClass dtor invoked" << std::endl;
}

void DerivedClass::someMethod()
{
    std::cout << "Derived class method" << std::endl;
}

int main()
{
    BaseClass* b_ptr = new BaseClass;
    b_ptr->someMethod();
    delete b_ptr;
}

When I built the above code (g++ test.cpp -o test) and then see what kind of symbols have been generated as follows,

nm --demangle test

I could see the following output.

==== following is partial output ====
08048816 T DerivedClass::someMethod()
08048922 T DerivedClass::~DerivedClass()
080489aa T DerivedClass::~DerivedClass()
08048a32 T DerivedClass::~DerivedClass()
08048842 T BaseClass::someMethod()
0804886e T BaseClass::~BaseClass()
080488f6 T BaseClass::~BaseClass()

My questions are as follows.

1) Why multiple dtors have been generated (BaseClass - 2, DerivedClass - 3)?

2) What are the difference among these dtors? How those multiple dtors will be selectively used?

I now have a feeling that in order to achieve 100% function coverage for C++ project, we would need to understand this so that I can invoke all those dtors in my unit tests.

I would greately appreciate if someone could give me the reply on the above.

First, the purposes of these functions are described in the Itanium C++ ABI; see definitions under "base object destructor", "complete object destructor", and "deleting destructor". The mapping to mangled names is given in 5.1.4.

Basically:

  • D2 is the "base object destructor". It destroys the object itself, as well as data members and non-virtual base classes.
  • D1 is the "complete object destructor". It additionally destroys virtual base classes.
  • D0 is the "deleting object destructor". It does everything the complete object destructor does, plus it calls operator delete to actually free the memory.

If you have no virtual base classes, D2 and D1 are identical; GCC will, on sufficient optimization levels, actually alias the symbols to the same code for both.

Why not have all the functions as virtual in C++?

32 votes

I know that virtual functions have an overhead of dereferencing to call a method. But I guess with modern architectural speed it is almost negligible.

  1. Is there any particular reason why all functions in C++ are not virtual as in Java?
  2. From my knowledge, defining a function virtual in a base class is sufficient/necessary. Now when I write a parent class, I might not know which methods would get over-ridden. So does that mean that while writing a child class someone would have to edit the parent class. This sounds like inconvenient and sometimes not possible?

Update:
Summarizing from Jon Skeet's answer below:

It's a trade-off between explicitly making someone realize that they are inheriting functionality [which has potential risks in themselves [(check Jon's response)] [and potential small performance gains] with a trade-off for less flexibility, more code changes, and a steeper learning curve.

Other reasons from different answers:

Virtual functions cannot be in-lined because inlining have to happen at runtime. This have performance impacts when you expect you functions benefits from inlining.

There might be potentially other reasons, and I would love to know and summarize them.

There are good reasons for controlling which methods are virtual beyond performance. While I don't actually make most of my methods final in Java, I probably should... unless a method is designed to be overridden, it probably shouldn't be virtual IMO.

Designing for inheritance can be tricky - in particular it means you need to document far more about what might call it and what it might call. Imagine if you have two virtual methods, and one calls the other - that must be documented, otherwise someone could override the "called" method with an implementation which calls the "calling" method, unwittingly creating a stack overflow (or infinite loop if there's tail call optimization). At that point you've then got less flexibility in your implementation - you can't switch it round at a later date.

Note that C# is a similar language to Java in various ways, but chose to make methods non-virtual by default. Some other people aren't keen on this, but I certainly welcome it - and I'd actually prefer that classes were uninheritable by default too.

Basically, it comes down to this advice from Josh Bloch: design for inheritance or prohibit it.

Should programmers use STL or write their own code?

32 votes

I don't know much about C++ data structures but I am wondering do you (programmers) use STL or write your own code? After all STL is designed for doing tasks like searching, replacing and much more through a list of data.

Someone really don't need to learn much about the linked list, binary search and many more because I could use STL. What would you suggest?

You should use STL, because it is well tested and optimized.

That doesn't mean you shouldn't know how to write these data structures yourself. With that ability under your belt, you will be able to choose the best STL data structure for your application.

What is "int i = 1;Why (i >= 60 * 60 * 1000 / 1 * 1000)" true?

31 votes

First, defining two constant expressions without parentheses is my fault:

#define BIG_INTERVAL 60 * 60 * 1000
#define SMALL_INTERVAL 1 * 1000

int i = 1;

if (i >= BIG_INTERVAL / SMALL_INTERVAL - 1)
{
    printf("Oops!\n");
}

The if statement after the macro expansion is if(i >= 60 * 60 * 1000 / 1 * 1000 - 1).

That is not my intention. But I find something strange if I write if (i >= 3600000000 - 1). It is false.

What type is 60 * 60 * 1000 / 1 * 1000 - 1 ? int?

All operators on ints return int. So yes, 60 * 60 * 1000 / 1 * 1000 - 1 is an int. But the expected result of 3599999999 is too big for an int, so the expression actually evaluates to -694967297 (assuming 32-bit int and two's complement).

This doesn't happen with a literal 3600000000 because integer literals larger than INT_MAX are of a type that can hold the full value.