What I Wish Everyone Knew About Floating Point Numbers


Floating point numbers, or floating-point numbers for short, are a number system that is often used to represent very small and very large numbers, including negative whole numbers. But these quantities in the real world, unlike integers like 1 and 2, cannot be exactly represented. The most common way to deal with this is through rounding; the computer will always round off your calculation so it looks correct—but if you know what’s going on in your calculations (or what you want from them) then it can get tricky. Fractional numeric variables that contain a decimal point are known as floating variables. Frequently used in games, scientific work and other engineering fields (where you need to have extremely accurate computations—such as models for aircraft), floating point variables are usually regarded as being slow because they require a fixed-point-to-floating conversion step at some point in the calculation. 

However, this is not necessarily true (in fact it’s quite uncommon). The article will explain how floating point numbers work: why they cannot be exactly represented in hardware, what you can do to make them faster, and what special features they have that make them a useful tool for certain purposes. We will also explore the so-called “NaN” or “Not-a-Number” phenomenon before discussing the catch: there are some very serious problems when dealing with floating point numbers, which is why many systems simply refuse to use them. Despite these difficulties and complexity, floating point numbers are still widely used for many applications where exactness is not important (e.g. games).

What I Wish Everyone Knew About Floating Point Numbers :

1. Why can’t we represent floating point numbers exactly?

Floating point numbers cannot be represented exactly by the hardware (i.e., the chip). This is because computer programming deals with a finite amount of “storage” to store the data, and it is desirable to use as little of this storage as possible to make calculations faster (but how fast does it need to be? If you’re happy with two decimal places for accuracy, you’re fine; for applications in engineering and scientific fields, much greater accuracy is often required). 

The “storage” in question is actually the RAM which is used for running computations. In addition, there is a finite number of “precision” or “bits” available to the processor in the machine, and this will restrict how many digits you can put into your floating point numbers.

2. Why can’t we always get the exact answer?

Floating point numbers are not meant to be exact. When you convert a float variable to a float, there is a small loss of precision. What does this mean? Let’s show you an example of this with our old friend NaN (Not-a-Number). The computer will store the number 245.5 as “245” followed by eight 1’s and five 0’s (an integer), then it will store the result as “NaN” (Not-a-Number) followed by three 1’s, two 0’s and three 1’s. The “NaN” part represents a floating point number that is undefined.

3. What are the special features of floating-point numbers?

Floating point numbers have some special features which make them useful for certain applications: They use base 2 instead of base 10 (like integers). This means the number 5 has 2 binary digits rather than 10 binary digits. Why does this matter? Because it makes the calculations faster, since you’re working in base 2 rather than base 10 (base-10 is more complex to work with, because you need to add one more digit than you would have in base-2). 

They also have a hidden “1” bit which is not included in the ordering system. Think about it this way: you can add real (base-10) numbers together, and the result will always be a single digit—but when you add two float variables, the result could be a “1” or a “0”. This makes floating point numbers more complicated, but more accurate. Another special feature of floating point numbers is that they can represent infinities (though not negative infinities).

4. What’s in a number?

The reason that floating point numbers can’t be exactly represented by hardware is because you are working with something called “floating-point fatigue”: these are finite quantities that cannot be represented exactly by the hardware. To get around this, computer scientists have come up with a system for representing these quantities, called the “binary32” or “float” system. In this number scheme, you have:

A sign bit (bit 0) that is either +1 or -1. A sign bit is to signify whether the number is positive (1) or negative (-1). An exponent (bits 8-19) that indicates the power of 2 that you need to multiply by ten to get to the number in question. For example, if your exponent was 01001110 then you would need to multiply by 210 (which gives you 1100100) to get 245 which is standard notation for float values.


Please enter your comment!
Please enter your name here