John Fremlin's blog: What's wrong with using double floats as the only numerical type

Posted 2010-10-15 22:00:00 GMT

The double-precision (64-bit) IEEE floating point representation can represent integers between -2^53 and to 2^53 exactly (2^53 is about 10^16). Jamie Zawinski recently posted an article describing his outrage on discovering that JavaScript specifies that all numerical calculations use doubles. But what's really wrong with doing that? There are legitimate views on both sides (if you're using doubles for most things, creating another class of numbers may over-complicate the system).

Firstly, the normal machine representation of integers (modulo 2^64 for example) is a ring. This means for example, that c*(a + b) is always precisely identical to ca + cb. This is not true in floating point — also (a+b)+c may not be a+(b+c). Additionally, if b is not zero, then (a*b)/b = a. This is also not true in floating point.

As an aside, and as Jamie Zawinski points out, the Common Lisp standard mandates arbitrary sized integers as the default, giving the programmer the option to specify the range a variable can take so that the program can be optimized for specific machines. The result of a division operation is an arbitrarily sized rational. This ensures that (a/b)*b = a (if b is not zero). The fact that these identities hold removes the need to consider whole classes of problems. However, the practical consequence is sometimes that the memory required to store the exact results becomes vast causing the program to crash or run slowly — at least it never gives false results.

Secondly, the twos-complement representation of integers is very natural; it is simple to specify arithmetic and bitwise operations on it. IEEE floating point numbers have all sorts of NaNs and other special values. The bitpattern of all zeros is not zero for a IEEE float but a NaN. With integers modulo 2^64, the only operation that can raise an error is dividing by zero; wrap-around is silent (and sometimes convenient) — with floating point, errors can be signalled on underflow, overflow, and so on. The problem that basic axioms like associativity of addition and multiplication do not hold is compounded by the possibility that the control flow is affected. This makes programs harder to reason about.

Thirdly, the fact that twos-complement arithmetic is natural and efficient (and can be efficiently used to implement arbitrary precision integer arithmetic for applications like RSA), means that machines have built-in support for it. How to index into an array by double? JavaScript implementations use machine-width integers where possible because of this. The SSE and FPU units in modern x86 processors are slightly separate from the main processor using different register files; why mandate that your program should run twice as slowly as necessary?

In the old days before large floating-point units became commonplace, fixed point arithmetic was popular — as I've said before you need to figure out which scale you data is on anyway if you're using floating point, so the up-front cost of fixed point (deciding what to divide by) is not as onerous as it appears.

Very nice post.

I think a more natural way to think about floating points is to treat them as intervals instead of a number. Thinking about how many bits of precision you can get after the computation is a similar way.

Though floating points are tricky, I like many of its features like NaN, +0 and -0, and exceptions.


Posted 2010-10-19 07:29:06 GMT by Anonymous from

The interval idea is nice. However, suppose you consider a number a, that is the range [a-e,a+e] and a number b, [b-d,b+d] then when you add them, you get a+b, which has a natural range in terms of the precision of the representation (the value of the ULP), but an uncertainty of + or - e+d, which is probably greater. Keeping track of the possible maximum error is also not very sensible, as in real life the errors will (probably) cancel out.

Posted 2010-11-23 11:57:32 GMT by John Fremlin

All zeros *is* zero, and not a Nan.

e.g., in python:

>>> import struct

>>> struct.unpack('d', '\x00'*8)


Posted 2011-01-27 23:21:06 GMT by Anonymous from

Post a comment