Three Myths About Floating-Point Numbers

Table of Contents

A single-precision floating-point number is represented by 32 bits and hides various wonderful encoding techniques. However, some of those tricks might cause some imprecise calculations so it’s crucial to know how to work with those numbers.

Let’s have a look at three common misconceptions.

This is a guest post from Adam Sawicki

Adam is a software engineer interested in graphics, media, and game development. He has broad experience, low level to high level, from developing graphics drivers, software for digital satellite television, making video games and 3D game engines, to photography, video recording and editing. You can find Adam on LinkedIn, Twitter, and his personal blog asawicki.info.

Intro

Floating-point numbers are a great invention in computer science, but they can also be tricky and troublesome to use correctly. I’ve written about them already by publishing Floating-Point Formats Cheatsheet and presentation “Pitfalls of floating-point numbers” (“Pułapki liczb zmiennoprzecinkowych” – the slides are in Polish). Last year I was preparing for a more extensive talk about this topic, but it got cancelled, like pretty much everything in these hard times of the COVID-19 pandemic. So in this post, I would like to approach this topic from a different angle.

A programmer can use floating-point numbers on different levels of understanding. A beginner would use them, trusting they are infinitely capable and precise, leading to problems. An intermediate programmer knows that they have some limitations, and so by using some good practices, the problems can be avoided. An advanced programmer understands what is really going on inside these numbers and can use them with full awareness of what to expect from them. This post may help you jump from step 2 to step 3. Commonly adopted good practices are called “myths” here, but they are actually just generalizations and simplifications. They can be useful for avoiding errors unless you understand what is true and what is false about them on a deeper level.

1. They are not exact

It is not true that 2.0 + 2.0 can give 3.99999. It will always be 4.0. They are exact to the extent of their limited range and precision. Suppose you assign a floating-point number some constant value. In that case, you can safely compare it with the same value later, even using the discouraged operator ==, as long as it is not a result of some calculations. Imprecisions doesn’t come out of anywhere.

Instead of using integer loop iterator and converting it to float every time:

for(size_t i = 0; i < count; ++i)
{
    float f = (float)i;
    // Use f
}

You can do this, which will result in a much more efficient code:

for(float f = 0.f; f < (float)count; f += 1.f)
{
    // Use f
}

It is true, however, that your numbers may not look exactly as expected because:

Some fractions cannot be represented exactly – even some simple ones like decimal 0.1, which is binary 0.0001101… This is because we humans normally use a decimal system, while floating-point numbers, like other numbers inside computers, use a binary system – a different base.
There is a limited range of integer numbers that can be represented exactly. For 32-bit floats, it is only 16,777,216. Above that, numbers start “jumping” every 2, then every 4, etc. So it is not a good idea to use floating-point numbers to represent file sizes if your files are bigger than 16 MB. If the count in the example above was >16M, it would cause an infinite loop.

64-bit “double”, however, represents integers exactly up to 9,007,199,254,740,992, so it should be enough for most applications. No wonder that some scripting languages do just fine while supporting only “double” floating-point numbers and no integers at all.

2. They are non-deterministic

It is not true that cosmic radiation will flip the least significant bit at random. Random number generators are also not involved. If you call the same function with your floating-point calculations with the same input, you will get the same output. It is fully deterministic, like other computing. (Note: When old FPU instructions are generated rather than new SSE, this can be really non-deterministic, and even a task switch may alter your numbers. See this tweet.)

It is true, however, that you may observe different results because:

Compiler optimizations can influence the result. If you implement two versions of your formula, similar but not exactly the same, the compiler may, for example, optimize (a * b + c) from doing MUL + ADD to FMA (fused multiply-add) instruction, which does the 3-argument operation in one step. FMA has higher precision but can then give a different result than two separate instructions.
You may observe different results on different platforms – e.g. AMD vs Intel CPU or AMD vs NVIDIA GPU. This is because the floating-point standard (IEEE 754) defines only the required precision of operations like sin, cos, etc., so the exact result may vary on the least significant bit.

I heard a story of a developer who tried to calculate hashes from the results of his floating-point calculations in a distributed system and discovered that records with what was supposed to be the same data had different hashes on different machines.

I once had to investigate a user complaint about the following piece of shader code (in GLSL language). The user said that on AMD graphics cards for uv.x higher than 306, it always returns black color (zero).

vec4 fragColor = vec4(vec3(fract(sin(uv.x * 2300.0 * 12000.0))), 1.0);

I noticed that the value passed to the sine function is very high. For uv.x = 306 it is 27,600,000. If we recall from math classes that sine cycles between -1 and 1 every 2*PI ≈ 6.283185 and we take into consideration that above 16,777,216 a 32-bit float cannot represent all integer numbers exactly, but start jumping every 2, then every 4 etc., we can conclude that we have not enough precision to know whether our result should be -1, 1, or anything in between. It is just undefined.

I then asked the user what is he trying to achieve with this code; as a result is totally random. He said it is indeed supposed to be… a random number generator. The problem is that the output being always 0 is as valid as any other. Random numbers are generated on NVIDIA cards and not on AMD because sine instruction on AMD GPU architectures actually has a period of 1, not 2*PI. But it is still fully deterministic in regards to an input value. It just returns different results between different platforms.

3. NaN and INF are an indication of an error

It is true that if you don’t expect them, their appearance may indicate an error, either in your formulas or in input data (e.g. numbers very large, very small and close to zero, or just garbage binary data). It is also true that they can cause trouble as they propagate through calculations, e.g. every operation with NaN returns NaN.

However, it is not true that these special values are just a means of returning error or that they are not useful. They are perfectly valid special cases of the floating-point representation and have clearly defined behavior. For example, -INF is smaller and +INF is larger than any finite number. You can use this property to implement the following function with a clearly documented interface:

#include <limits>

// Finds and returns maximum number from given array.
// For empty array returns -INF.
float CalculateMax(const float* a, size_t count)
{
    float max = -std::numeric_limits<float>::infinity();
    for(size_t i = 0; i < count; ++i)
        if(a[i] > max)
                    max = a[i];
    return max;
}

Summary

As you can see, common beliefs about floating-point numbers - that they are not exact, non-deterministic, or that NaN and INF are an indication of an error, are some generalizations and simplifications that can help to avoid errors, but they don’t tell the full story. To really understand what’s going on on a deeper level:

Keep in mind which values in your program are just input data or constants and results of some calculations.
Know the capabilities and limitations of floating-point types - their maximum range, minimum possible number, precision in terms of binary or decimal places, maximum integer represented precisely.
Learn about how floating-point numbers are stored, bit by bit.
Learn about special values - INF, NaN, positive and negative zero, denormals. Understand how they behave in computations.
Take a look at assembly generated by the compiler to see how CPU or GPU operates on your numbers.