Best questions in August 2011

How does this CSS triangle shape work?

372 votes

There're plenty of different CSS shapes over at http://css-tricks.com/examples/ShapesOfCSS/ and I'm particularly puzzled with a triangle:

Triangle

#triangle-up {
    width: 0;
    height: 0;
    border-left: 50px solid transparent;
    border-right: 50px solid transparent;
    border-bottom: 100px solid red;
}

So, how and why does it work?

CSS Triangles: A Tragedy in Five Acts

As alex said, borders of equal width butt up against each other at 45 degree angles:

borders meet at 45 degree angles, content in middle

When you have no top border, it looks like this:

no top border

Then you give it a width of 0...

no width

...and a height of 0...

no height either

...and finally, you make the two side borders transparent:

transparent side borders

That results in a triangle.

The End

Why must we define both == and != in C#?

171 votes

The C# compiler requires that whenever a custom type defines operator ==, it must also define != (see here).

Why?

I'm curious to know why the designers thought it necessary and why can't the compiler default to a reasonable implementation for either of the operators when only the other is present. For example, Lua lets you define only the equality operator and you get the other for free. C# could do the same by asking you to define either == or both == and != and then automatically compile the missing != operator as !(left == right).

I understand that there are weird corner cases where some entities may neither be equal nor unequal, (like IEEE-754 NaN's), but those seem like the exception, not the rule. So this doesn't explain why the C# compiler designers made the exception the rule.

I've seen cases of poor workmanship where the equality operator is defined, then the inequality operator is a copy-paste with each and every comparison reversed and every && switched to a || (you get the point... basically !(a==b) expanded through De Morgan's rules). That's poor practice that the compiler could eliminate by design, as is the case with Lua.

Note: The same holds for operators < > <= >=. I can't imagine cases where you'll need to define these in unnatural ways. Lua lets you define only < and <= and defines >= and > naturally through the formers' negation. Why doesn't C# do the same (at least 'by default')?

EDIT

Apparently there are valid reasons to allow the programmer to implement checks for equality and inequality however they like. Some of the answers point to cases where that may be nice.

The kernel of my question, however, is why this is forcibly required in C# when usually it's not logically necessary?

It is also in striking contrast to design choices for .NET interfaces like Object.Equals, IEquatable.Equals IEqualityComparer.Equals where the lack of a NotEquals counterpart shows that the framework considers !Equals() objects as unequal and that's that. Furthermore, classes like Dictionary and methods like .Contains() depend exclusively on the aforementioned interfaces and do not use the operators directly even if they are defined. In fact, when ReSharper generates equality members, it defines both == and != in terms of Equals() and even then only if the user chooses to generate operators at all. The equality operators aren't needed by the framework to understand object equality.

Basically, the .NET framework doesn't care about these operators, it only cares about a few Equals methods. The decision to require both == and != operators to be defined in tandem by the user is related purely to the language design and not object semantics as far as .NET is concerned.

I can't speak for the language designers, but from what I can reason on, it seems like it was intentional, proper design decision.

Looking at this basic F# code, you can compile this into a working library. This is legal code for F#, and only overloads the equality operator, not the inequality:

module Module1

type Foo() =
    let mutable myInternalValue = 0
    member this.Prop
        with get () = myInternalValue
        and set (value) = myInternalValue <- value

    static member op_Equality (left : Foo, right : Foo) = left.Prop = right.Prop
    //static member op_Inequality (left : Foo, right : Foo) = left.Prop <> right.Prop

This does exactly what it looks like. It creates an equality comparer on == only, and checks to see if the internal values of the class are equal.

While you can't create a class like this in C#, you can use one that was compiled for .NET. It's obvious it will use our overloaded operator for == So, what does the runtime use for !=?

The C# EMCA standard has a whole bunch of rules (section 14.9) explaining how to determine which operator to use when evaluating equality. To put it overly-simplified and thus not perfectly accurate, if the types that are being compared are of the same type and there is an overloaded equality operator present, it will use that overload and not the standard reference equality operator inherited from Object. It is no surprise, then, that if only one of the operators is present, it will use the default reference equality operator, that all objects have, there is not an overload for it.1

Knowing that this is the case, the real question is: Why was this designed in this way and why doesn't the compiler figure it out on its own? A lot people are saying this wasn't a design decision, but I like to think it was thought out this way, especially regarding the fact all objects have a default equality operator.

So, why doesn't the compiler automagically create the != operator? I can't know for sure unless someone from Microsoft confirms this, but this is what I can determine from reasoning on the facts.


To prevent unexpected behavior

Perhaps I want to do a value comparison on == to test equality. However, when it came to != I didn't care at all if the values were equal unless the reference was equal, because for my program to consider them equal, I only care if the references match. After all, this is actually outlined as default behavior of the C# (if both operators were not overloaded, as would be in case of some .net libraries written in another language). If the compiler was adding in code automatically, I could no longer rely on the compiler to output code that should is compliant. The compiler should not write hidden code that changes the behavior of yours, especially when the code you've written is within standards of both C# and the CLI.

In terms of it forcing you to overload it, instead of going to the default behavior, I can only firmly say that it is in the standard (EMCA-334 17.9.2)2. The standard does not specify why. I believe this is due to the fact that C# borrows much behavior from C++. See below for more on this.


When you override != and ==, you do not have to return bool.

This is another likely reason. In C#, this function:

public static int operator ==(MyClass a, MyClass b) { return 0; }

is as valid as this one:

public static bool operator ==(MyClass a, MyClass b) { return true; }

If you're returning something other than bool, the compiler cannot automatically infer an opposite type. Furthermore, in the case where your operator does return bool, it just doesn't make sense for them create generate code that would only exist in that one specific case or, as I said above, code that hides the default behavior of the CLR.


C# borrows much from C++3

When C# was introduced, there was an article in MSDN magazine that wrote, talking about C#:

Many developers wish there was a language that was easy to write, read, and maintain like Visual Basic, but that still provided the power and flexibility of C++.

Yes the design goal for C# was to give nearly the same amount of power as C++, sacrificing only a little for conveniences like rigid type-safety and garbage-collection. C# was strongly modeled after C++.

You may not be surprised to learn that in C++, the equality operators do not have to return bool, as shown in this example program

Now, C++ does not directly require you to overload the complimentary operator. If your compiled the code in the example program, you will see it runs with no errors. However, if you tried adding the line:

cout << (a != b);

you will get

compiler error C2678 (MSVC) : binary '!=' : no operator found which takes a left-hand operand of type 'Test' (or there is no acceptable conversion)`.

So, while C++ itself doesn't require you to overload in pairs, it will not let you use an equality operator that you haven't overloaded on a custom class. It's valid in .NET, because all objects have a default one; C++ does not.


1. As a side note, the C# standard still requires you to overload the pair of operators if you want to overload either one. This is a part of the standard and not simply the compiler. However, the same rules regarding the determination of which operator to call apply when you're accessing a .net library written in another language that doesn't have the same requirements.

2. EMCA-334 (pdf) (http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-334.pdf)

3. And Java, but that's really not the point here

Why does [1,2] + [3,4] = "1,23,4" in JavaScript?

150 votes

I wanted to add the elements of an array into another, so I tried this simple sentence in our beloved Firebug:

[1,2] + [3,4]

It responded with:

"1,23,4"

What is going on?

The + operator is not defined for arrays.

What happens is that Javascript converts arrays into strings and concatenates those.

 

Update

Since this question and consequently my answer is getting a lot of attention I felt that in addition to the insightful stuff posted by Jeremy Banks it would be useful to have an overview about how the + operator behaves in general.

So, here it goes.

Excluding EX4 and implementation-specific stuff, JavaScript has 6 built-in data types:

  1. undefined
  2. boolean
  3. number
  4. string
  5. function
  6. object

Note that neither null nor [] is a separate type - both return object when fed to typeof. However + works differently in either case.

That's right - JavaScript has no primitive arrays as such; only instances of a class called Array with some syntactic sugar to ease the pain.

Adding more to the confusion, wrapper entities such as new Number(5), new Boolean(true) and new String("abc") are all of object type, not numbers, booleans or strings as one might expect. Nevertheless for arithmetic operators Number and Boolean behave as numbers.

Easy, huh? With all that out of the way, we can move on to the overview itself.

Different result types of + by operand types

-------------------------------------------------------------------------------------------
            | undefined | boolean | number | string | function | object | null   | array  | 
-------------------------------------------------------------------------------------------

undefined   | number    | number  | number | string | string   | string | number | string | 

boolean     | number    | number  | number | string | string   | string | number | string | 

number      | number    | number  | number | string | string   | string | number | string | 

string      | string    | string  | string | string | string   | string | string | string | 

function    | string    | string  | string | string | string   | string | string | string | 

object      | string    | string  | string | string | string   | string | string | string | 

null        | number    | number  | number | string | string   | string | number | string | 

array       | string    | string  | string | string | string   | string | string | string | 

-------------------------------------------------------------------------------------------

* this applies to Chrome 13, Firefox 6, Opera 11 and IE9. Checking other browsers and versions is left as an exercise for the reader.

Note: As pointed out by CMS, for certain cases of objects such as Number, Boolean and custom ones the + operator doesn't necessarily produce a string result. It can vary depending on the implementation of object to primitive conversion. For example var o = { valueOf:function () { return 4; } }; evaluating o + 2; produces 6, a number, evaluating o + '2' produces '42', a string.

To see how the overview table was generated visit http://jsfiddle.net/4EjXd/

int a[] = {1,2,}; Weird comma allowed. Any particular reason?

126 votes

Maybe I am not from this planet, but it would seem to me that the following should be a syntax error:

int a[] = {1,2,}; //extra comma in the end

But it's not. I was surprised when this code compiled on Visual Studio, but I have learnt not to trust MSVC compiler as far as C++ rules are concerned, so I checked the standard and it is allowed by the standard as well. You can see 8.5.1 for the grammar rules if you don't believe me.

enter image description here

Why is this allowed? This may be a stupid useless question but I want you to understand why I am asking. If it were a sub-case of a general grammar rule, I would understand - they decided not to make the general grammar any more difficult just to disallow a redundant comma at the end of an initializer list. But no, the additional comma is explicitly allowed. For example, it isn't allowed to have a redundant comma in the end of a function-call argument list (when the function takes ...), which is normal.

So, again, is there any particular reason this redundant comma is explicitly allowed?

It makes it easier to generate source code, and also to write code which can be easily extended at a later date. Consider what's required to add an extra entry to:

int a[] = {
   1,
   2,
   3
};

... you have to add the comma to the existing line and add a new line. Compare that with the case where the three already has a comma after it, where you just have to add a line. Likewise if you want to remove a line you can do so without worrying about whether it's the last line or not, and you can reorder lines without fiddling about with commas. Basically it means there's a uniformity in how you treat the lines.

Now think about generating code. Something like (pseudo-code):

output("int a[] = {");
for (int i = 0; i < items.length; i++) {
    output("%s, ", items[i]);
}
output("};");

No need to worry about whether the current item you're writing out is the first or the last. Much simpler.

Making your .NET language step correctly in the debugger

103 votes

Firstly, I apologize for the length of this question.

I am the author of IronScheme. Recently I have been working hard on emitting decent debug info, so that I can use the 'native' .NET debugger.

While this has been partly successful, I am running into some teething problems.

The first problem is related to stepping.

Due to Scheme being an expression language, everything tends to be wrapped in parenthesis, unlike the major .NET languages which seems to be statement (or line) based.

The original code (Scheme) looks like:

(define (baz x)
  (cond
    [(null? x) 
      x]
    [(pair? x) 
      (car x)]
    [else
      (assertion-violation #f "nooo" x)]))

I have on purpose laid out each expression on a newline.

The emitted code transforms to C# (via ILSpy) looks like:

public static object ::baz(object x)
{
  if (x == null)
  {
    return x;
  }
  if (x is Cons)
  {
    return Builtins.Car(x);
  }
  return #.ironscheme.exceptions::assertion-violation+(
     RuntimeHelpers.False, "nooo", Builtins.List(x));
}

As you can see, pretty simple.

Note: If the code was transformed into a conditional expression (?:) in C#, the whole thing would just be one debug step, keep that in mind.

Here is IL output with source and line numbers:

  .method public static object  '::baz'(object x) cil managed
  {
    // Code size       56 (0x38)
    .maxstack  6
    .line 15,15 : 1,2 ''
//000014: 
//000015: (define (baz x)
    IL_0000:  nop
    .line 17,17 : 6,15 ''
//000016:   (cond
//000017:     [(null? x) 
    IL_0001:  ldarg.0
    IL_0002:  brtrue     IL_0009

    .line 18,18 : 7,8 ''
//000018:       x]
    IL_0007:  ldarg.0
    IL_0008:  ret

    .line 19,19 : 6,15 ''
//000019:     [(pair? x) 
    .line 19,19 : 6,15 ''
    IL_0009:  ldarg.0
    IL_000a:  isinst [IronScheme]IronScheme.Runtime.Cons
    IL_000f:  ldnull
    IL_0010:  cgt.un
    IL_0012:  brfalse    IL_0020

    IL_0017:  ldarg.0
    .line 20,20 : 7,14 ''
//000020:       (car x)]
    IL_0018:  tail.
    IL_001a:  call object [IronScheme]IronScheme.Runtime.Builtins::Car(object)
    IL_001f:  ret

    IL_0020:  ldsfld object 
         [Microsoft.Scripting]Microsoft.Scripting.RuntimeHelpers::False
    IL_0025:  ldstr      "nooo"
    IL_002a:  ldarg.0
    IL_002b:  call object [IronScheme]IronScheme.Runtime.Builtins::List(object)
    .line 22,22 : 7,40 ''
//000021:     [else
//000022:       (assertion-violation #f "nooo" x)]))
    IL_0030:  tail.
    IL_0032:  call object [ironscheme.boot]#::
       'ironscheme.exceptions::assertion-violation+'(object,object,object)
    IL_0037:  ret
  } // end of method 'eval-core(033)'::'::baz'

Note: To prevent the debugger from simply highlighting the entire method, I make the method entry point just 1 column wide.

As you can see, each expression maps correctly to a line.

Now the problem with stepping (tested on VS2010, but same/similar issue on VS2008):

These are with IgnoreSymbolStoreSequencePoints not applied.

  1. Call baz with null arg, it works correctly. (null? x) followed by x.
  2. Call baz with Cons arg, it works correctly. (null? x) then (pair? x) then (car x).
  3. Call baz with other arg, it fails. (null? x) then (pair? x) then (car x) then (assertion-violation ...).

When applying IgnoreSymbolStoreSequencePoints (as recommended):

  1. Call baz with null arg, it works correctly. (null? x) followed by x.
  2. Call baz with Cons arg, it fails. (null? x) then (pair? x).
  3. Call baz with other arg, it fails. (null? x) then (pair? x) then (car x) then (assertion-violation ...).

I also find in this mode that some lines (not shown here) are incorrectly highlighted, they are off by 1.

Here are some ideas what could be the causes:

  • Tailcalls confuses the debugger
  • Overlapping locations (not shown here) confuses the debugger (it does so very well when setting a breakpoint)
  • ????

The second, but also serious, issue is the debugger failing to break/hit breakpoints in some cases.

The only place where I can get the debugger to break correctly (and consistantly), is at the method entry point.

The situation gets a bit better when IgnoreSymbolStoreSequencePoints is not applied.

Conclusion

It might be that the VS debugger is just plain buggy :(

References:

  1. Making a CLR/.NET Language Debuggable

Update 1:

Mdbg does not work for 64-bit assemblies. So that is out. I have no more 32-bit machines to test it on. Update: I am sure this is no big problem, does anyone have a fix? Edit: Yes, silly me, just start mdbg under the x64 command prompt :)

Update 2:

I have created a C# app, and tried to dissect the line info.

My findings:

  • After any brXXX instruction you need to have a sequence point (if not valid aka '#line hidden', emit a nop).
  • Before any brXXX instruction, emit a '#line hidden' and a nop.

Applying this, does not however fix the issues (alone?).

But adding the following, gives the desired result :)

  • After ret, emit a '#line hidden' and a nop.

This is using the mode where IgnoreSymbolStoreSequencePoints is not applied. When applied, some steps are still skipped :(

Here is the IL output when above has been applied:

  .method public static object  '::baz'(object x) cil managed
  {
    // Code size       63 (0x3f)
    .maxstack  6
    .line 15,15 : 1,2 ''
    IL_0000:  nop
    .line 17,17 : 6,15 ''
    IL_0001:  ldarg.0
    .line 16707566,16707566 : 0,0 ''
    IL_0002:  nop
    IL_0003:  brtrue     IL_000c

    .line 16707566,16707566 : 0,0 ''
    IL_0008:  nop
    .line 18,18 : 7,8 ''
    IL_0009:  ldarg.0
    IL_000a:  ret

    .line 16707566,16707566 : 0,0 ''
    IL_000b:  nop
    .line 19,19 : 6,15 ''
    .line 19,19 : 6,15 ''
    IL_000c:  ldarg.0
    IL_000d:  isinst     [IronScheme]IronScheme.Runtime.Cons
    IL_0012:  ldnull
    IL_0013:  cgt.un
    .line 16707566,16707566 : 0,0 ''
    IL_0015:  nop
    IL_0016:  brfalse    IL_0026

    .line 16707566,16707566 : 0,0 ''
    IL_001b:  nop
    IL_001c:  ldarg.0
    .line 20,20 : 7,14 ''
    IL_001d:  tail.
    IL_001f:  call object [IronScheme]IronScheme.Runtime.Builtins::Car(object)
    IL_0024:  ret

    .line 16707566,16707566 : 0,0 ''
    IL_0025:  nop
    IL_0026:  ldsfld object 
      [Microsoft.Scripting]Microsoft.Scripting.RuntimeHelpers::False
    IL_002b:  ldstr      "nooo"
    IL_0030:  ldarg.0
    IL_0031:  call object [IronScheme]IronScheme.Runtime.Builtins::List(object)
    .line 22,22 : 7,40 ''
    IL_0036:  tail.
    IL_0038:  call object [ironscheme.boot]#::
      'ironscheme.exceptions::assertion-violation+'(object,object,object)
    IL_003d:  ret

    .line 16707566,16707566 : 0,0 ''
    IL_003e:  nop
  } // end of method 'eval-core(033)'::'::baz'

Update 3:

Problem with above 'semi-fix'. Peverify reports errors on all methods due to the nop after ret. I dont understand the problem really. How can a nop break verification after a ret. It is like dead code (except that it is NOT even code) ... Oh well, experimentation continues.

Update 4:

Back at home now, removed the 'unverifiable' code, running on VS2008 and things are a lot worse. Perhaps running unverifiable code for the sake of proper debugging might be the answer. In 'release' mode, all output would still be verifiable.

Update 5:

I have now decided my above idea is the only viable option for now. Although the generated code is unverifiable, I have yet to find any VerificationException's. I dont know what the impact will be on the end user with this scenario.

As a bonus, my second issue has also be solved. :)

Here is a little screencast of what I ended up with. It hits breakpoints, does proper stepping (in/out/over), etc. All in all, the desired effect.

I, however, am still not accepting this as the way to do it. It feel overly-hacky to me. Having a confirmation on the real issue would be nice.

Update 6:

Just had the change to test the code on VS2010, there seems to be some problems:

  1. The first call now does not step correctly. (assertion-violation ...) is hit. Other cases works fine. Some old code emitted unnecessary positions. Removed the code, works as expected. :)
  2. More seriously, breakpoints fail on the second invocation of the program (using in-memory compilation, dumping assembly to file seems to make breakpoints happy again).

Both these cases work correctly under VS2008. The main difference is that under VS2010, the entire application is compiled for .NET 4 and under VS2008, compiles to .NET 2. Both running 64-bit.

Update 7:

Like mentioned, I got mdbg running under 64-bit. Unfortunately, it also have the breakpoint issue where it fails to break if I rerun the program (this implies it gets recompiled, so not using the same assembly, but still using the same source).

Update 8:

I have filed a bug at the MS Connect site regarding the breakpoint issue.

Rant mode on

Let's hope this gets fixed by MS, unlike previous bugs the claim to have fixed, but are yet to be seen in public.

Rant mode off

I am an engineer on the Visual Studio Debugger team.

Correct me if I am wrong, but it sounds like the only issue left is that when switching from PDBs to the .NET 4 dynamic compile symbol format some breakpoints are being missed.

We would probably need a repro to exactly diagnose the issue, however here are some notes that might help.

  1. VS (2008+) can-to run as a non-admin
  2. Do any symbols load at all the second time around? You might test by breaking in (through exception or call System.Diagnostic.Debugger.Break())
  3. Assuming that symbols load, is there a repro that you could send us?
  4. The likely difference is that the symbol format for dynamic-compiled code is 100% different between .NET 2 (PDB stream) and .NET 4 (IL DB I think they called it?)
  5. The 'nop's sound about right. See rules for generating implicit sequence points below.
  6. You don't actually need to emit things on different lines. By default, VS will step 'symbol-statements' where, as the compiler writer you get to define what 'symbol-statement' means. So if you want each expression to be a separate thing in the symbol file, that will work just fine.

The JIT creates an implicit sequence point based on the following rules: 1. IL nop instructions 2. IL stack empty points 3. The IL instruction immediately following a call instruction

If it turns out we do need a repro to solve your issue, you can file a connect bug and upload files securely through that medium.

Luke

Speed comparison with Project Euler: C vs Python vs Erlang vs Haskell

84 votes

I have taken Problem #12 from Project Euler as a programming exercise and to compare my (surely not optimal) implementations in C, Python, Erlang and Haskell. In order to get some higher execution times, I search for the first triangle number with more than 1000 divisors instead of 500 as stated in the original problem.

The result is the following:

C:

lorenzo@enzo:~/erlang$ gcc -lm -o euler12.bin euler12.c
lorenzo@enzo:~/erlang$ time ./euler12.bin
842161320

real    0m11.074s
user    0m11.070s
sys 0m0.000s

python:

lorenzo@enzo:~/erlang$ time ./euler12.py 
842161320

real    1m16.632s
user    1m16.370s
sys 0m0.250s

python with pypy:

lorenzo@enzo:~/Downloads/pypy-c-jit-43780-b590cf6de419-linux64/bin$ time ./pypy /home/lorenzo/erlang/euler12.py 
842161320

real    0m13.082s
user    0m13.050s
sys 0m0.020s

erlang:

lorenzo@enzo:~/erlang$ erlc euler12.erl 
lorenzo@enzo:~/erlang$ time erl -s euler12 solve
Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.4  (abort with ^G)
1> 842161320

real    0m48.259s
user    0m48.070s
sys 0m0.020s

haskell:

lorenzo@enzo:~/erlang$ ghc euler12.hs -o euler12.hsx
[1 of 1] Compiling Main             ( euler12.hs, euler12.o )
Linking euler12.hsx ...
lorenzo@enzo:~/erlang$ time ./euler12.hsx 
842161320

real    2m37.326s
user    2m37.240s
sys 0m0.080s

Summary:

  • C: 100%
  • python: 692% (118% with pypy)
  • erlang: 436% (135% thanks to RichardC)
  • haskell: 1421%

I suppose that C has a big advantage as it uses long for the calculations and not arbitrary length integers as the other three. Also it doesn't need to load a runtime first (Do the others?).

Question 1: Do Erlang, Python and Haskell loose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

Question 2: Why is Haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as Haskell is a book with seven seals to me.)

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

EDIT:

Question 4: Do my functional implementations permit LCO (last call optimization, a.k.a tail recursion elimination) and hence avoid adding unnecessary frames onto the call stack?

I really tried to implement the same algorithm as similar as possible in the four languages, although I have to admit that my Haskell and Erlang knowledge is very limited.


Source codes used:

#include <stdio.h>
#include <math.h>

int factorCount (long n)
{
    double square = sqrt (n);
    int isquare = (int) square;
    int count = isquare == square ? -1 : 0;
    long candidate;
    for (candidate = 1; candidate <= isquare; candidate ++)
        if (0 == n % candidate) count += 2;
    return count;
}

int main ()
{
    long triangle = 1;
    int index = 1;
    while (factorCount (triangle) < 1001)
    {
        index ++;
        triangle += index;
    }
    printf ("%ld\n", triangle);
}

#! /usr/bin/env python3.2

import math

def factorCount (n):
    square = math.sqrt (n)
    isquare = int (square)
    count = -1 if isquare == square else 0
    for candidate in range (1, isquare + 1):
        if not n % candidate: count += 2
    return count

triangle = 1
index = 1
while factorCount (triangle) < 1001:
    index += 1
    triangle += index

print (triangle)

-module (euler12).
-compile (export_all).

factorCount (Number) -> factorCount (Number, math:sqrt (Number), 1, 0).

factorCount (_, Sqrt, Candidate, Count) when Candidate > Sqrt -> Count;

factorCount (_, Sqrt, Candidate, Count) when Candidate == Sqrt -> Count + 1;

factorCount (Number, Sqrt, Candidate, Count) ->
    case Number rem Candidate of
        0 -> factorCount (Number, Sqrt, Candidate + 1, Count + 2);
        _ -> factorCount (Number, Sqrt, Candidate + 1, Count)
    end.

nextTriangle (Index, Triangle) ->
    Count = factorCount (Triangle),
    if
        Count > 1000 -> Triangle;
        true -> nextTriangle (Index + 1, Triangle + Index + 1)  
    end.

solve () ->
    io:format ("~p~n", [nextTriangle (1, 1) ] ),
    halt (0).

factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
    where square = sqrt $ fromIntegral number
          isquare = floor square

factorCount' number sqrt candidate count
    | fromIntegral candidate > sqrt = count
    | number `mod` candidate == 0 = factorCount' number sqrt (candidate + 1) (count + 2)
    | otherwise = factorCount' number sqrt (candidate + 1) count

nextTriangle index triangle
    | factorCount triangle > 1000 = triangle
    | otherwise = nextTriangle (index + 1) (triangle + index + 1)

main = print $ nextTriangle 1 1

Using GHC 7.0.3, gcc 4.4.6, Linux 2.6.29 on an x86_64 Core2 Duo (2.5GHz) machine, compiling using ghc -O2 -fllvm -fforce-recomp for Haskell and gcc -O3 -lm for C.

  • Your C routine runs in 8.4 seconds (faster than your run probably because of -O3)
  • The Haskell solution runs in 36 seconds (due to the -O2 flag)
  • Your factorCount' code is polymorphic and doesn't seem to be getting specialized for some reason. Giving an explicit type signature (which is standard practice anyway) and the time changes to 11.1 seconds
  • in factorCount' you have needlessly called fromIntegral. A fix results in no change though (the compiler is smart, lucky for you).
  • You used mod where rem is faster and sufficient. This changes the time to 8.5 seconds.
  • factorCount' is constantly applying two extra arguments that never change (candidate, sqrt). A worker/wrapper transformation gives us:

    $ time ./so 842161320

    real 0m7.954s user 0m7.944s sys 0m0.004s

That's right, 7.95 seconds. Consistently half a second faster than the C solution. Without the -fllvm flag I'm still getting 8.182 seconds, so the NCG backend is doing well in this case too.

Conclusion: Haskell is awesome.

Resulting Code

factorCount number = factorCount' number isquare 1 0 - (fromEnum $ square == fromIntegral isquare)
    where square = sqrt $ fromIntegral number
          isquare = floor square

factorCount' :: Int -> Int -> Int -> Int -> Int
factorCount' number sqrt candidate0 count0 = go candidate0 count0
  where
  go candidate count
    | candidate > sqrt = count
    | number `rem` candidate == 0 = go (candidate + 1) (count + 2)
    | otherwise = go (candidate + 1) count

nextTriangle index triangle
    | factorCount triangle > 1000 = triangle
    | otherwise = nextTriangle (index + 1) (triangle + index + 1)

main = print $ nextTriangle 1 1

EDIT: So now that we've explored that, lets address the questions

Question 1: Do erlang, python and haskell loose speed due to using arbitrary length integers or don't they as long as the values are less than MAXINT?

In Haskell, using Integer is slower than Int but how much slower depends on the computations performed. Luckily (for 64 bit machines) Int is sufficient. For portability sake you should probably rewrite my code to use Int64 or Word64 (C isn't the only language with a long).

Question 2: Why is haskell so slow? Is there a compiler flag that turns off the brakes or is it my implementation? (The latter is quite probable as haskell is a book with seven seals to me.)

Question 3: Can you offer me some hints how to optimize these implementations without changing the way I determine the factors? Optimization in any way: nicer, faster, more "native" to the language.

That was what I answered above. The answer was 0) Use optimization via -O2 1) Specialization (avoid unneeded polymorphism) 2) rem not mod (a frequently forgotten optimization) and 3) worker/wrapper transformation (perhaps the most common optimization).

Question 4: Do my functional implementations permit LCO and hence avoid adding unnecessary frames onto the call stack?

Yes, that wasn't the issue. Good work and glad you considered this.

Why are static variables considered evil?

82 votes

I am a Java programmer who is new to the corporate world. Recently I've developed an application using Groovy and Java. All through the code I've used quite a good number of statics. I was asked by the senior technical lot to cut out on the number of statics used. I've googled about the same, and I find that many programmers are fairly against using static variables.

I find static variables more convenient to use. And I presume that they are efficient too (please correct me if I am wrong), because if I had to make 10,000 calls to a function within a class, I would be glad to make the method static and use a straightforward class.methodCall() on it instead of cluttering the memory with 10,000 instances of the class, right?

Moreover statics reduce the inter-dependencies on the other parts of the code. They can act as perfect state holders. Adding to this I find that statics are widely implemented in some languages like Smalltalk and Scala. So why is this oppression for statics prevalent among programmers (especially in the world of Java)?

PS: please do correct me if my assumptions about statics are wrong.

Static variables represent global state. That's hard to reason about and hard to test: if I create a new instance of an object, I can reason about its new state within tests. If I use code which is using static variables, it could be in any state - and anything could be modifying it.

I could go on for quite a while, but the bigger concept to think about is that the tighter the scope of something, the easier it is to reason about. We're good at thinking about small things, but it's hard to reason about the state of a million line system if there's no modularity. This applies to all sorts of things, by the way - not just static variables.

Why are private fields private to the type, not the instance?

80 votes

In C# (and many other languages) it's perfectly legitimate to access private fields of other instances of the same type. For example:

public class Foo
{
    private bool aBool;

    public void DoBar(Foo anotherFoo)
    {
        if(anotherFoo.aBool) ...
    }
}

As the C# specification (sections 3.5.1, 3.5.2) states access to private fields is on a type, not an instance. I've been discussing this with a colleague and we're trying to come up with a reason why it works like this (rather than restricting access to the same instance).

The best argument we could come up with is for equality checks where the class may want to access private fields to determine equality with another instance. Are there any other reasons? Or some golden reason that absolutely means it must work like this or something would be completely impossible?

I think one reason it works this way is because access modifiers work at compile time. As such, determining whether or not a given object is also the current object isn't easy to do. For example, consider this code:

public class Foo
{
    private int bar;

    public void Baz(Foo other)
    {
        other.bar = 2;
    }

    public void Boo()
    {
        Baz(this);
    }
}

Can the compiler necessarily figure out that other is actually this? Not in all cases. One could argue that this just shouldn't compile then, but that means we have a code path where a private instance member of the correct instance isn't accessible, which I think is even worse.

Only requiring type-level rather than object-level visibility ensures that the problem is tractable, as well as making a situation that seems like it should work actually work.

EDIT: Danilel Hilgarth's point that this reasoning is backwards does have merit. Language designers can create the language they want, and compiler writers must conform to it. That being said, language designers do have some incentive to make it easier for compiler writers to do their job. (Though in this case, it's easy enough to argue that private members could then only be accessed via this (either implicitly or explicitly)).

However, I believe that makes the issue more confusing than it needs to be. Most users (myself included) would find it unneccessarily limiting if the above code didn't work: after all, that's my data I'm trying to access! Why should I have to go through this?

In short, I think I may have overstated the case for it being "difficult" for the compiler. What I really meant to get across is that above situation seems like one that the designers would like to have work.

Is there a better waiting pattern for c#?

64 votes

I've found myself coding this type of thing a few times.

for (int i = 0; i < 10; i++)
{
   if (Thing.WaitingFor())
   {
      break;
   }
   Thread.Sleep(sleep_time);
}
if(!Thing.WaitingFor())
{
   throw new ItDidntHappenException();
}

It just looks like bad code, is there a better way of doing this / is it a symptom of bad design?

A much better way to implement this pattern is to have your Thing object expose an event on which the consumer can wait. For example a ManualResetEvent or AutoResetEvent. This greatly simplifies your consumer code to be the following

if (!Thing.ManualResetEvent.WaitOne(sleep_time)) {
  throw new ItDidntHappen();
}

// It happened

The code on the Thing side is also not really any more complex.

public sealed class Thing {
  public readonly ManualResetEvent ManualResetEvent = new ManualResetEvent(false);

  private void TheAction() {
    ...
    // Done.  Signal the listeners
    ManualResetEvent.Set();
  }
}

Why do we usually use `||` not `|`?

63 votes

Possible Duplicate:
What's the difference between | and || in Java?

I'm just wondering why we usually use logial OR || between two booleans not bitwise OR |, though they are both working well.

I mean, look at the following:

if(true  | true)  // pass
if(true  | false) // pass
if(false | true)  // pass
if(false | false) // no pass
if(true  || true)  // pass
if(true  || false) // pass
if(false || true)  // pass
if(false || false) // no pass

Can we use | instead of ||? Same thing with & and &&.

If you use the || and && forms, rather than the | and & forms of these operators, Java will not bother to evaluate the right-hand operand alone.

It's a matter of if you want to short-circuit the evaluation or not -- most of the time you want to.

A good way to illustrate the benefits of short-circuiting would be to consider the following example.

Boolean b = true;
if(b || foo.timeConsumingCall())
{
   //we entered without calling timeConsumingCall()
}

Another benefit, as Jeremy and Peter mentioned, for short-circuiting is the null reference check:

if(string != null && string.isEmpty())
{
    //we check for string being null before calling isEmpty()
}

more info

True-way solution in Java: parse 2 numbers from 2 strings and then return their sum

63 votes

Quite a stupid question. Given the code:

public static int sum(String a, String b) /* throws? WHAT? */ {
  int x = Integer.parseInt(a); // throws NumberFormatException
  int y = Integer.parseInt(b); // throws NumberFormatException
  return x + y;
}

Could you tell if it's good Java or not? What I'm talking about is, NumberFormatException is an unchecked exception. You don't have to specify it as part of sum() signature. Moreover, as far as I understand, the idea of unchecked exceptions is just to signal that program's implementation is incorrect, and even more, catching unchecked exceptions is a bad idea, since it's like fixing bad program at runtime.

Would somebody please clarify whether:

  1. I should specify NumberFormatException as a part of method's signature.
  2. I should define my own checked exception (BadDataException), handle NumberFormatException inside the method and re-throw it as BadDataException.
  3. I should define my own checked exception (BadDataException), validate both strings some way like regular expressions and throw my BadDataException if it doesn't match.
  4. Your idea?

Update:

Imagine, it's not an open-source framework, that you should use for some reason. You look at method's signature and think - "OK, it never throws". Then, some day, you got an exception. Is it normal?

Update 2:

There are some comments saying my sum(String, String) is a bad design. I do absolutely agree, but for those who believe that original problem would just never appear if we had good design, here's an extra question:

The problem definition is like this: you have a data source where numbers are stored as Strings. This source may be XML file, web page, desktop window with 2 edit boxes, whatever.

Your goal is to implement the logic that takes these 2 Strings, converts them to ints and displays message box saying "the sum is xxx".

No matter what's the approach you use to design/implement this, you'll have these 2 points of inner functionality:

  1. A place where you convert String to int
  2. A place where you add 2 ints

The primary question of my original post is:

Integer.parseInt() expects correct string to be passed. Whenever you pass a bad string, it means that your program is incorrect (not "your user is an idiot"). You need to implement the piece of code where on one hand you have Integer.parseInt() with MUST semantics and on the other hand you need to be OK with the cases when input is incorrect - SHOULD semantics.

So, briefly: how do I implement SHOULD semantics if I only have MUST libraries.

This is a good question. I wish more people would think about such things.

IMHO, throwing unchecked exceptions is acceptable if you've been passed rubbish parameters.

Generally speaking, you shouldn't throw BadDataException because you shouldn't use Exceptions to control program flow. Exceptions are for the exceptional. Callers to your method can know before they call it if their strings are numbers or not, so passing rubbish in is avoidable and therefore can be considered a programming error, which means it's OK to throw unchecked exceptions.

Regarding declaring throws NumberFormatException - this is not that useful, because few will notice due to NumberFormatException being unchecked. However, IDE's can make use of it and offer to wrap in try/catch correctly. A good option is to use javadoc as well, eg:

/**
 * Adds two string numbers
 * @param a
 * @param b
 * @return
 * @throws NumberFormatException if either of a or b is not an integer
 */
public static int sum(String a, String b) throws NumberFormatException {
    int x = Integer.parseInt(a); 
    int y = Integer.parseInt(b); 
    return x + y;
}

EDITED:
The commenters have made valid points. You need to consider how this will be used and the overall design of your app.

If the method will be used all over the place, and it's important that all callers handle problems, the declare the method as throwing a checked exception (forcing callers to deal with problems), but cluttering the code with try/catch blocks.

If on the other hand we are using this method with data we trust, then declare it as above, because it is not expected to ever explode and you avoid the code clutter of essentially unnecessary try/catch blocks.

Why does this code crash?

59 votes

I went to a job interview today and was given this interesting question.

Besides the memory leak and the fact there is no virtual dtor, why does this code crash?

#include <iostream>

//besides the obvious mem leak, why does this code crash?

class Shape
{
public:
    virtual void draw() const = 0;
};

class Circle : public Shape
{
public:
    virtual void draw() const { }

    int radius;
};

class Rectangle : public Shape
{
public:
    virtual void draw() const { }

    int height;
    int width;
};

int main()
{
    Shape * shapes = new Rectangle[10];
    for (int i = 0; i < 10; ++i)
        shapes[i].draw();
}

You cannot index like that. You have allocated an array of Rectangles and stored a pointer to the first in shapes. When you do shapes[1] you're dereferencing (shapes + 1). This will not give you a pointer to the next Rectangle, but a pointer to what would be the next Shape in a presumed array of Shape. Of course, this is undefined behaviour. In your case, you're being lucky and getting a crash.

Using a pointer to Rectangle makes the indexing work correctly.

int main()
{
   Rectangle * shapes = new Rectangle[10];
   for (int i = 0; i < 10; ++i) shapes[i].draw();
}

If you want to have different kinds of Shapes in the array and use them polymorphically you need an array of pointers to Shape.

Can you explain why ++[[]][+[]]+[+[]] = 10

55 votes

Possible Duplicate:
(![]+[])[+[]]… Explain why this works

++[[]][+[]]+[+[]]

is valid and return "10" in JavaScript (more example here: http://sla.ckers.org/forum/read.php?24,33349,33405).

Can you explain why? I don't understand what's happening here.

If we split it up, the mess is equal to:

++[[]][+[]]
+
[+[]]

In JavaScript, it is true that +[] === 0.

Therefore, we can simplify the mess:

++[[]][0]
+
[0]

Because [[]][0] means: get the first element from [[]], it is true that:

  • [[]][0] === [] (this is not exactly true due to references, but it's what it comes down to)
  • ++[[]][0] === [] + 1, since ++ means 'increment by one'.

Again, we can simplify the mess into something more legible:

[] + 1
+
[0]

In JavaScript, this is true as well: [] + 1 === "1", because [] == "" (joining an empty array), so:

  • [] + "1" === "" + "1", and
  • "" + "1" === "1"

Simplify it even more:

"1"
+
[0]

Also, this is true in JavaScript: [0] == "0", because it's joining an array with 1 element. Joining will concatenate the elements separated by ,. With one element, you can deduce that this logic will result in the first element itself.

So, in the end we obtain:

"1"
+
"0"

=== "10" // Yay!

Specification details for +[]:

This is quite a maze, but to do +[], first it is being converted to a string because that's what + says:

11.4.6 Unary + Operator

The unary + operator converts its operand to Number type.

The production UnaryExpression : + UnaryExpression is evaluated as follows:

  1. Let expr be the result of evaluating UnaryExpression.

  2. Return ToNumber(GetValue(expr)).

ToNumber() says:

Object

Apply the following steps:

  1. Let primValue be ToPrimitive(input argument, hint String).

  2. Return ToString(primValue).

ToPrimitive() says:

Object

Return a default value for the Object. The default value of an object is retrieved by calling the [[DefaultValue]] internal method of the object, passing the optional hint PreferredType. The behaviour of the [[DefaultValue]] internal method is defined by this specification for all native ECMAScript objects in 8.12.8.

[[DefaultValue]] says:

8.12.8 [[DefaultValue]] (hint)

When the [[DefaultValue]] internal method of O is called with hint String, the following steps are taken:

  1. Let toString be the result of calling the [[Get]] internal method of object O with argument "toString".

  2. If IsCallable(toString) is true then,

a. Let str be the result of calling the [[Call]] internal method of toString, with O as the this value and an empty argument list.

b. If str is a primitive value, return str.

The .toString of an array says:

15.4.4.2 Array.prototype.toString ( )

When the toString method is called, the following steps are taken:

  1. Let array be the result of calling ToObject on the this value.

  2. Let func be the result of calling the [[Get]] internal method of array with argument "join".

  3. If IsCallable(func) is false, then let func be the standard built-in method Object.prototype.toString (15.2.4.2).

  4. Return the result of calling the [[Call]] internal method of func providing array as the this value and an empty arguments list.

So +[] comes down to +"", because [].join() === "".

Again, the + is defined as:

11.4.6 Unary + Operator

The unary + operator converts its operand to Number type.

The production UnaryExpression : + UnaryExpression is evaluated as follows:

  1. Let expr be the result of evaluating UnaryExpression.

  2. Return ToNumber(GetValue(expr)).

ToNumber is defined for "" as:

The MV of StringNumericLiteral ::: [empty] is 0.

So +"" === 0, and thus +[] === 0.

When is an integer<->pointer cast actually correct?

51 votes

The common folklore says that:

  • The type system exists for a reason. Integers and pointers are distinct types, casting between them is a malpractice in the majority of cases, may indicate a design error and should be avoided.

  • Even when such a cast is performed, no assumptions shall be made about the size of integers and pointers (casting void* to int is the simplest way to make the code fail on x64), and instead of int one should use intptr_t or uintptr_t from stdint.h.

Knowing that, when is it actually useful to perform such casts?

(Note: having a bit shorter code for the price of portability doesn't count as "actually useful".)


One case I know:

  • Some lock-free multiprocessor algorithms exploit the fact that a 2+-byte-alligned pointer has some redundancy. They then use the lowest bits of the pointer as boolean flags, for instance. With a processor having an appropriate instruction set, this may eliminate the need for a locking mechanism (which would be necessary if the pointer and the boolean flag were separate).
    (Note: This practice is even possible to do safely in Java via java.util.concurrent.atomic.AtomicMarkableReference)

Anything more?

I sometimes cast pointers to integers when they somehow need to be part of a hashsum. Also I cast them to integers to do some bitfiddling with them on certain implemetnations where it is guaranteed that pointers always have one or two spare bits left, where I can encode AVL or RB Tree information in the left/right pointers instead of having an additional member. But this is all so implementation specific that I recommend to never think about it as any kind of common solution. Also I heard that sometimes hazard pointers can be implemented with such a thing.

In some situations I need a unique ID per object that I pass along to e.g. servers as my request id. Depending on the context when I need to save some memory, and it is worth it, I use the address of my object as such an id, and usually have to cast it to an integer.

When working with embedded systems (such as in canon cameras, see chdk) there are often magic addesses, so a (void*)0xFFBC5235 or similar is often found there too

edit:

Just stumbled (in my mind) over pthread_self() which returns a pthread_t which is usually a typedef to an unsigned integer. Internally though it is a pointer to some thread struct, representing the thread in question. In general it might used elsewhere for an opaque handle.

What is 'print' in Python?

42 votes

I understand what print does, but of what "type" is the operator? I think it's a function, but why does this fail?

>>>print print
SyntaxError: invalid syntax

Isn't print a function? Shouldn't it print something like this?

>>>print print
<function print at ...>

In 2.7 and down, print is a statement. In python 3, print is a function. To use the print function in Python 2.6 or 2.7, you can do

[~/repo/py]
|4>from __future__ import print_function

[~/repo/py]
|5>print print
-->print(print)
<built-in function print>

See this section from the Python Language Reference, as well as PEP 3105 for why it changed.

For loop instead of while

42 votes

I'm reading through the jQuery source and I stumbled upon the following piece of code (available here):

for (; i < length;) {
    if (callback.apply(object[i++], args) === false) {
        break;
    }
}

Why is a for loop used here instead of a while loop?

I vote for someone having an affinity for bad coding style. That's the only explanation I can see for both the for loop and the i++ being placed inside of an array subscript. I think this would be better:

while (i < length && callback.apply(object[i], args)) {
    i++;
}

Or if you happen to feel that the === operator in the original example has value, then:

while (i < length && callback.apply(object[i], args) !== false) {
    i++;
}

Another possible reason for doing this may have been as a performance optimization. Although this quick benchmark that I put together seems to disprove that theory. On Windows the while loop above is 20% faster than the original for loop in Chrome, in IE and Firefox both loops perform the same. On OS X the for loop has a 10% advantage in Firefox, there is no difference between the two in Chrome, and Safari prefers the while loop by 6%.

So from a performance standpoint it's a wash. If anything then judging by market share you would probably want to optimize for Chrome on Windows before optimizing for Firefox on Mac, in which case the while loop would be preferred.

Which says to me that it's unlikely that performance optimization played a factor with this code. And I return to my original theory that it's just an example of poor coding style making it past the code-review process.

Fastest way to strip all non-printable characters from a Java String

41 votes

What is the fastest way to strip all non-printable characters from a String in Java?

So far I've tried and measured on 138-byte, 131-character String:

  • String's replaceAll() - slowest method
    • 517009 results / sec
  • Precompile a Pattern, then use Matcher's replaceAll()
    • 637836 results / sec
  • Use StringBuffer, get codepoints using codepointAt() one-by-one and append to StringBuffer
    • 711946 results / sec
  • Use StringBuffer, get chars using charAt() one-by-one and append to StringBuffer
    • 1052964 results / sec
  • Preallocate a char[] buffer, get chars using charAt() one-by-one and fill this buffer, then convert back to String
    • 2022653 results / sec
  • Preallocate 2 char[] buffers - old and new, get all chars for existing String at once using getChars(), iterate over old buffer one-by-one and fill new buffer, then convert new buffer to String - my own fastest version
    • 2502502 results / sec
  • Same stuff with 2 buffers - only using byte[], getBytes() and specifying encoding as "utf-8"
    • 857485 results / sec
  • Same stuff with 2 byte[] buffers, but specifying encoding as a constant Charset.forName("utf-8")
    • 791076 results / sec
  • Same stuff with 2 byte[] buffers, but specifying encoding as 1-byte local encoding (barely a sane thing to do)
    • 370164 results / sec

My best try was the following:

    char[] oldChars = new char[s.length()];
    s.getChars(0, s.length(), oldChars, 0);
    char[] newChars = new char[s.length()];
    int newLen = 0;
    for (int j = 0; j < s.length(); j++) {
        char ch = oldChars[j];
        if (ch >= ' ') {
            newChars[newLen] = ch;
            newLen++;
        }
    }
    s = new String(newChars, 0, newLen);

Any thoughts on how to make it even faster?

Bonus points for answering a very strange question: why using "utf-8" charset name directly yields better performance than using pre-allocated static const Charset.forName("utf-8")?

Update

  • Suggestion from ratchet freak yields impressive 3105590 results / sec performance, a +24% improvement!
  • Suggestion from Ed Staub yields yet another improvement - 3471017 results / sec, a +12% over previous best.

Update 2

I've tried my best to collected all the proposed solutions and its cross-mutations and published it as a small benchmarking framework at github. Currently it sports 17 algorithms. One of them is "special" - Voo1 algorithm (provided by SO user Voo) employs intricate reflection tricks thus achieving stellar speeds, but it messes up JVM strings' state, thus it's benchmarked separately.

You're welcome to check it out and run it to determine results on your box. Here's a summary of results I've got on mine. It's specs:

  • Debian sid
  • Linux 2.6.39-2-amd64 (x86_64)
  • Java installed from a package sun-java6-jdk-6.24-1, JVM identifies itself as
    • Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
    • Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

Different algorithms show ultimately different results given a different set of input data. I've ran a benchmark in 3 modes:

Same single string

This mode works on a same single string provided by StringSource class as a constant. The showdown is:

 Ops / s  │ Algorithm
──────────┼──────────────────────────────
6 535 947 │ Voo1
──────────┼──────────────────────────────
5 350 454 │ RatchetFreak2EdStaub1GreyCat1
5 249 343 │ EdStaub1
5 002 501 │ EdStaub1GreyCat1
4 859 086 │ ArrayOfCharFromStringCharAt
4 295 532 │ RatchetFreak1
4 045 307 │ ArrayOfCharFromArrayOfChar
2 790 178 │ RatchetFreak2EdStaub1GreyCat2
2 583 311 │ RatchetFreak2
1 274 859 │ StringBuilderChar
1 138 174 │ StringBuilderCodePoint
  994 727 │ ArrayOfByteUTF8String
  918 611 │ ArrayOfByteUTF8Const
  756 086 │ MatcherReplace
  598 945 │ StringReplaceAll
  460 045 │ ArrayOfByteWindows1251

In charted form: Same single string chart

Multiple strings, 100% of strings contain control characters

Source string provider pre-generated lots of random strings using (0..127) character set - thus almost all strings contained at least one control character. Algorithms received strings from this pre-generated array in round-robin fashion.

 Ops / s  │ Algorithm
──────────┼──────────────────────────────
2 123 142 │ Voo1
──────────┼──────────────────────────────
1 782 214 │ EdStaub1
1 776 199 │ EdStaub1GreyCat1
1 694 628 │ ArrayOfCharFromStringCharAt
1 481 481 │ ArrayOfCharFromArrayOfChar
1 460 067 │ RatchetFreak2EdStaub1GreyCat1
1 438 435 │ RatchetFreak2EdStaub1GreyCat2
1 366 494 │ RatchetFreak2
1 349 710 │ RatchetFreak1
  893 176 │ ArrayOfByteUTF8String
  817 127 │ ArrayOfByteUTF8Const
  778 089 │ StringBuilderChar
  734 754 │ StringBuilderCodePoint
  377 829 │ ArrayOfByteWindows1251
  224 140 │ MatcherReplace
  211 104 │ StringReplaceAll

In charted form: Multiple strings, 100% concentration

Multiple strings, 1% of strings contain control characters

Same as previous, but only 1% of strings was generated with control characters - other 99% was generated in using [32..127] character set, so they couldn't contain control characters at all. This synthetic load comes the closest to real world application of this algorithm at my place.

 Ops / s  │ Algorithm
──────────┼──────────────────────────────
3 711 952 │ Voo1
──────────┼──────────────────────────────
2 851 440 │ EdStaub1GreyCat1
2 455 796 │ EdStaub1
2 426 007 │ ArrayOfCharFromStringCharAt
2 347 969 │ RatchetFreak2EdStaub1GreyCat2
2 242 152 │ RatchetFreak1
2 171 553 │ ArrayOfCharFromArrayOfChar
1 922 707 │ RatchetFreak2EdStaub1GreyCat1
1 857 010 │ RatchetFreak2
1 023 751 │ ArrayOfByteUTF8String
  939 055 │ StringBuilderChar
  907 194 │ ArrayOfByteUTF8Const
  841 963 │ StringBuilderCodePoint
  606 465 │ MatcherReplace
  501 555 │ StringReplaceAll
  381 185 │ ArrayOfByteWindows1251

In charted form: Multiple strings, 1% concentration

It's very hard for me to decide on who provided the best answer, but given the real-world application best solution was given/inspired by Ed Staub, I guess it would be fair to mark his answer. Thanks for all who took part in this, your input was very helpful and invaluable. Feel free to run the test suite on your box and propose even better solutions (working JNI solution, anyone?).

References

If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:

char [] oldChars = new char[5];

String stripControlChars(String s)
{
    final int inputLen = s.length();
    if ( oldChars.length < inputLen )
    {
        oldChars = new char[inputLen];
    }
    s.getChars(0, inputLen, oldChars, 0);

etc...

This is a big win - 20% or so, as I understand the current best case.

If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.

Automatically pick a variable type big enough to hold a specified number

38 votes

Is there any way in C++ define a type that is big enough to hold at most a specific number, presumably using some clever template code. For example I want to be able to write :-

Integer<10000>::type dataItem;

And have that type resolve to the smallest type that is big enough to hold the specified value?

Background: I need to generate some variable defintions using a script from an external data file. I guess I could make the script look at the values and then use uint8_t, uint16_t, uint32_t, etc. depending on the value, but it seems more elegant to build the size into the generated C++ code.

I can't see any way to make a template that can do this, but knowing C++ templates, I'm sure there is a way. Any ideas?

Boost.Integer already has facilities for Integer Type Selection:

boost::int_max_value_t<V>::least

The smallest, built-in, signed integral type that can hold all the values in the inclusive range 0 - V. The parameter should be a positive number.

boost::uint_value_t<V>::least

The smallest, built-in, unsigned integral type that can hold all positive values up to and including V. The parameter should be a positive number.

Why does new String("") compile while char c = '' does not?

38 votes

Why are empty Strings valid and empty chars are not ? I would have thought an empty String is not a string but just a placeholder. The same for a char, but creating an empty char does not even compile.

What im wondering is why the following occurs - Compiles -

String s = "";

Does not compile -

char c = '';

Because char represents a single character, which '' isn't. A String can contain zero or more characters, but a character cannot be anything other than a single character.

Should I Return None or (None, None)?

36 votes

We have a object method that returns a city/state tuple, i.e. ('Boston', 'MA'). Under some valid circumstances, there is no valid city/state to return. Stylistically, does it make more sense to return None, or a two element tuple containing (None, None) in that case? Thoughts, considerations, meditations?

I would return None. If there is no result, why return something that looks like a result?

It is also easier to test:

result = getCity()
if result:
   # do something

I would only return (None, None) if it were possible that only one of the two values is None (i.e. ('Boston', None)). It would be more consistent in this case.