Best floating-point questions in June 2012

26 votes

I have been reading a lot about floats and computer-processed floating-point operations. The biggest question I see when reading about them is why are they so inaccurate? I understand this is because binary cannot accurately represent all real numbers, so the numbers are rounded to the 'best' approximation.

My question is, knowing this, why do we still use binary as the base for computer operations? Surely using a larger base number than 2 would increase the accuracy of floating-point operations exponentially, would it not?

What are the advantages of using a binary number system for computers as opposed to another base, and has another base ever been tried? Or is it even possible?

Computers are built on transistors, which have a "switched on" state, and a "switched off" state. This corresponds to high and low voltage. Pretty much all digital integrated circuits work in this binary fashion.

Ignoring the fact that transistors just simply work this way, using a different base (e.g. base 3) would require these circuits to operate at an intermediate voltage state (or several) as well as 0V and their highest operating voltage. This is more complicated, and can result in problems at high frequencies - how can you tell whether a signal is just transitioning between 2V and 0V, or actually at 1V?

When we get down to the floating point level, we are (as nhahtdh mentioned in their answer) mapping an infinite space of numbers down to a finite storage space. It's an absolute guarantee that we'll lose some precision. One advantage of IEEE floats, though, is that the precision is relative to the magnitude of the value.

Update: You should also check out Tunguska, a ternary computer emulator. It uses base-3 instead of base-2, which makes for some interesting (albeit mind-bending) concepts.

printing float, preserving precision

14 votes

I am writing a program that prints floating point literals to be used inside another program.

How many digits do I need to print in order to preserve the precision of the original float?

Since a float has 24 * (log(2) / log(10)) = 7.2247199 decimal digits of precision, my initial thought was that printing 8 digits should be enough. But if I'm unlucky, those 0.2247199 get distributed to the left and to the right of the 7 significant digits, so I should probably print 9 decimal digits.

Is my analysis correct? Is 9 decimal digits enough for all cases? Like printf("%.9g", x);?

Is there a standard function that converts a float to a string with the minimum number of decimal digits required for that value, in the cases where 7 or 8 are enough, so I don't print unnecessary digits?

Note: I cannot use hexadecimal floating point literals, because standard C++ does not support them.

In order to guarantee that a binary->decimal->binary roundtrip recovers the original binary value, IEEE 754 requires


The original binary value will be preserved by converting to decimal and back again using:[10]

    5 decimal digits for binary16
    9 decimal digits for binary32
    17 decimal digits for binary64
    36 decimal digits for binary128

For other binary formats the required number of decimal digits is

    1 + ceiling(p*log10(2)) 

where p is the number of significant bits in the binary format, e.g. 24 bits for binary32.

In C, the functions you can use for these conversions are snprintf() and strtof/strtod/strtold().

Of course, in some cases even more digits can be useful (no, they are not always "noise", depending on the implementation of the decimal conversion routines such as snprintf() ). Consider e.g. printing dyadic fractions.

Why does C# allow an *implicit* conversion from Long to Float, when this could lose precision?

13 votes

A similar question Long in Float, why? here does not answer what I am searching for.

C# standard allows implicit conversion from long to float. But any long greater than 2^24 when represented as a float is bound to lose its 'value'. C# standard clearly states that long to float conversion may lose 'precision' but will never lose 'magnitude'.

My Questions are
  1. In reference to integral types what is meant by 'precision' and 'magnitude'. Isn't number n totally different from number n+1 unlike real numbers where 3.333333 and 3.333329 may be considered close enough for a calculation (i.e. depending on what precision programmer wants)
  2. Isn't allowing implicit conversion from long to float an invitation to subtle bugs as it can lead a long to 'silently' lose value (as a C# programmer I am accustomed to compiler doing an excellent job in guarding me against such issues)

So what could have been the rationale of C# language design team in allowing this conversion as implicit? What is it that I am missing here that justifies implicit conversion from long to float?

In general, floating point numbers don't represent many numbers exactly. By their nature they are inexact and subject to precision errors. It really doesn't add value to warn you about what is always the case with floating point.

Round-twice error in .NET's Double.ToString method

12 votes

Mathematically, consider for this question the rational number

8725724278030350 / 2**48

where ** in the denominator denotes exponentiation, i.e. the denominator is 2 to the 48th power. (The fraction is not in lowest terms, reducible by 2.) This number is exactly representable as a System.Double. Its decimal expansion is

31.0000000000000'49'73799150320701301097869873046875 (exact)

where the apostrophes do not represent missing digits but merely mark the boudaries where rounding to 15 resp. 17 digits is to be performed.

Note the following: If this number is rounded to 15 digits, the result will be 31 (followed by thirteen 0s) because the next digits (49...) begin with a 4 (meaning round down). But if the number is first rounded to 17 digits and then rounded to 15 digits, the result could be 31.0000000000001. This is because the first rounding rounds up by increasing the 49... digits to 50 (terminates) (next digits were 73...), and the second rounding might then round up again (when the midpoint-rounding rule says "round away from zero").

(There are many more numbers with the above characteristics, of course.)

Now, it turns out that .NET's standard string representation of this number is "31.0000000000001". The question: Isn't this a bug? By standard string representation we mean the String produced by the parameterles Double.ToString() instance method which is of course identical to what is produced by ToString("G").

An interesting thing to note is that if you cast the above number to System.Decimal then you get a decimal that is 31 exactly! See this Stack Overflow question for a discussion of the surprising fact that casting a Double to Decimal involves first rounding to 15 digits. This means that casting to Decimal makes a correct round to 15 digits, whereas calling ToSting() makes an incorrect one.

To sum up, we have a floating-point number that, when output to the user, is 31.0000000000001, but when converted to Decimal (where 29 digits are available), becomes 31 exactly. This is unfortunate.

Here's some C# code for you to verify the problem:

static void Main()
{
  const double evil = 31.0000000000000497;
  string exactString = DoubleConverter.ToExactString(evil); // Jon Skeet, http://csharpindepth.com/Articles/General/FloatingPoint.aspx 

  Console.WriteLine("Exact value (Jon Skeet): {0}", exactString);   // writes 31.00000000000004973799150320701301097869873046875
  Console.WriteLine("General format (G): {0}", evil);               // writes 31.0000000000001
  Console.WriteLine("Round-trip format (R): {0:R}", evil);          // writes 31.00000000000005

  Console.WriteLine();
  Console.WriteLine("Binary repr.: {0}", String.Join(", ", BitConverter.GetBytes(evil).Select(b => "0x" + b.ToString("X2"))));

  Console.WriteLine();
  decimal converted = (decimal)evil;
  Console.WriteLine("Decimal version: {0}", converted);             // writes 31
  decimal preciseDecimal = decimal.Parse(exactString, CultureInfo.InvariantCulture);
  Console.WriteLine("Better decimal: {0}", preciseDecimal);         // writes 31.000000000000049737991503207
}

The above code uses Skeet's ToExactString method. If you don't want to use his stuff (can be found through the URL), just delete the code lines above dependent on exactString. You can still see how the Double in question (evil) is rounded and cast.

ADDITION:

OK, so I tested some more numbers, and here's a table:

  exact value (truncated)       "R" format         "G" format     decimal cast
 -------------------------  ------------------  ----------------  ------------
 6.00000000000000'53'29...  6.0000000000000053  6.00000000000001  6
 9.00000000000000'53'29...  9.0000000000000053  9.00000000000001  9
 30.0000000000000'49'73...  30.00000000000005   30.0000000000001  30
 50.0000000000000'49'73...  50.00000000000005   50.0000000000001  50
 200.000000000000'51'15...  200.00000000000051  200.000000000001  200
 500.000000000000'51'15...  500.00000000000051  500.000000000001  500
 1020.00000000000'50'02...  1020.000000000005   1020.00000000001  1020
 2000.00000000000'50'02...  2000.000000000005   2000.00000000001  2000
 3000.00000000000'50'02...  3000.000000000005   3000.00000000001  3000
 9000.00000000000'54'56...  9000.0000000000055  9000.00000000001  9000
 20000.0000000000'50'93...  20000.000000000051  20000.0000000001  20000
 50000.0000000000'50'93...  50000.000000000051  50000.0000000001  50000
 500000.000000000'52'38...  500000.00000000052  500000.000000001  500000
 1020000.00000000'50'05...  1020000.000000005   1020000.00000001  1020000

The first column gives the exact (though truncated) value that the Double represent. The second column gives the string representation from the "R" format string. The third column gives the usual string representation. And finally the fourth column gives the System.Decimal that results from converting this Double.

We conclude the following:

  • Round to 15 digits by ToString() and round to 15 digits by conversion to Decimal disagree in very many cases
  • Conversion to Decimal also rounds incorrectly in many cases, and the errors in these cases cannot be described as "round-twice" errors
  • In my cases, ToString() seems to yield a bigger number than Decimal conversion when they disagree (no matter which of the two rounds correctly)

I only experimented with cases like the above. I haven't checked if there are rounding errors with numbers of other "forms".

So from your experiments, it appears that Double.ToString doesn't do correct rounding.

That's rather unfortunate, but not particularly surprising: doing correct rounding for binary to decimal conversions is nontrivial, and also potentially quite slow, requiring multiprecision arithmetic in corner cases. See David Gay's dtoa.c code here for one example of what's involved in correctly-rounded double-to-string and string-to-double conversion. (Python currently uses a variant of this code for its float-to-string and string-to-float conversions.)

Even the current IEEE 754 standard for floating-point arithmetic recommends, but doesn't require that conversions from binary floating-point types to decimal strings are always correctly rounded. Here's a snippet, from section 5.12.2, "External decimal character sequences representing finite numbers".

There might be an implementation-defined limit on the number of significant digits that can be converted with correct rounding to and from supported binary formats. That limit, H, shall be such that H ≥ M+3 and it should be that H is unbounded.

Here M is defined as the maximum of Pmin(bf) over all supported binary formats bf, and since Pmin(float64) is defined as 17 and .NET supports the float64 format via the Double type, M should be at least 17 on .NET. In short, this means that if .NET were to follow the standard, it would be providing correctly rounded string conversions up to at least 20 significant digits. So it looks as though the .NET Double doesn't meet this standard.

In answer to the 'Is this a bug' question, much as I'd like it to be a bug, there really doesn't seem to be any claim of accuracy or IEEE 754 conformance anywhere that I can find in the number formatting documentation for .NET. So it might be considered undesirable, but I'd have a hard time calling it an actual bug.


EDIT: Jeppe Stig Nielsen points out that the System.Double page on MSDN states that

Double complies with the IEC 60559:1989 (IEEE 754) standard for binary floating-point arithmetic.

It's not clear to me exactly what this statement of compliance is supposed to cover, but even for the older 1985 version of IEEE 754, the string conversion described seems to violate the binary-to-decimal requirements of that standard.

Given that, I'll happily upgrade my assessment to 'possible bug'.