### Impossible to represent 1.001 as a double?

Hi all,

If I assign 1.001L to a double, then inspect it with the debugger, it has a value 1.00099999999

Similarly if I sscanf a string of "1.001" with format "%ld" or even "%.3ld", I still get 1.0009999999

I understand that representing decimal values with binary digits has its limitations, but also expected 1.001 could be accurately represented as a sign, mantissa and exponent with binary values?

Is there something I am missing here? I feel really dumb at the moment!

Cheers
JohnO

PS: using VisualStudio2015 on Win10 64bit
The exact representation may not be the issue, the implementation may vary.

I think it's a matter of the display format. If you want to see the number displayed to 3 decimal places, then do that.
 ... but also expected 1.001 could be accurately represented as a sign, mantissa and exponent with binary values?

https://en.wikipedia.org/wiki/Floating-point_arithmetic

Well FP values are represented in part by negative powers of 2, and not all values can represented exactly, even such numbers as 0.2 (only 1 dp). So this leads to all kinds of subtleties. If you print out the values with 16 significant figures, you will see the difference is relatively small, perhaps in the 14th or 15th sf. At larger scales the difference is numerically larger: for example at 1e22, the difference between values is 1 million.

The `double` type can exactly represent whole numbers up to 32 bits.

In real life, the choice of units helps, for example astrophysicists often use light years as a unit of distance rather than metres. This is OK because I gather that they can only measure with an accuracy of 1 or 2 l.y. so that is roughly 1 or 2 parts per 10 billion. So this means a `double` type is quite convenient.

Here is a question to make one think :

Consider what values of a circle radius would still give a reasonable answer for the circle area: consider both at lower and upper limits of double values? It is easy to see how things become quite chunky at both ends of the scale. Also consider the practicalities of these situations.

C++ does have exact decimals, though not standardized yet:
https://stackoverflow.com/questions/14096026/c-decimal-data-types

There are also multi precision libraries:
https://stackoverflow.com/questions/2568446/the-best-cross-platform-portable-arbitrary-precision-math-library

So what does all this mean for writing code with the built in FP types (float, double, long double)?

* Prefer `double` over float. The precision of float (6 sf) is easily exceeded. Only use `float` if some library requires it - often graphic libraries. This is why double is the default for FP literals.
* Use relational operators < , <= etc, not equality.
* one can write "equality" operators. These are equality within some precision value. That is: Is the absolute value of the difference between 2 values less than the precision value?
* don't use FP values in loop conditions. Instead work out how many times to loop as an integer, and use that in the loop condition.
* There are many subtleties with FP, these are only some of them.

Good Luck !!

Don't forget ones searching foo - there's a lot to be found on the internet :+)
Thanks man, plenty to think about...

But are you surprised that something like this:

double d = 1.001L;

when inspecting d with the debugger shows 1.0009999999999999?

Yet

float f = 1.001;

when inspecting f with debugger shows 1.0010000?

Double is somehow worse than float?

I am starting to think it is Visual Studio messing with me!

In my application I am receiving a number "1.001" represented as a string and I need to convert it into a floating point number. The range of possible values is from 0.001 to 999.999 with 3 digit maximum precision.

Maybe I will have to use an arbitrary precision math library?
 Double is somehow worse than float?

With this:

 float f = 1.001;

the literal 1.001 is a `double` , the compiler does an implicit cast to float.

this is different:

 float f = 1.001f;

here the literal is explicitly a float value.

But the main thing is how the float and double differ in their representation. `float` only has 6 or 7 sf, and this might mean some rounding up, for particular values, and not for others. It might be instructive for you to write some code that prints values of floats and doubles at different scales and step values. For example print all the values from 1.000 to 2.000 as float with 7 sf, and the same for double with 17 sf. Try some other scales like 30.000 to 31.000 say, and different steps like 1e-6

 In my application I am receiving a number "1.001" represented as a string and I need to convert it into a floating point number. The range of possible values is from 0.001 to 999.999 with 3 digit maximum precision.

So I guess it's an assignment, so you are doing manually? I would follow the procedure that IEEE uses, as best you can. The first thing to do is work out how many bits you need for the mantissa and the exponent. Hopefully you can use std::bitset to store them, otherwise one of the integer types with bit fields.

 Maybe I will have to use an arbitrary precision math library?

Apart from what I mentioned above, I guess that would be cheating for the assignment.
Last edited on
Not sure what level your assignment actually is, maybe it's something much simpler:

0.973 programmatically becomes 9.0 / 10.0 + 7.0 / 100.0 + 3.0 / 1000.0
Is this only about how your debugger returns a value off by something negligible when the code internally works fine?

I don't see why you have to worry about something being off by a zillion-th, when the result is just fine.

(If this isn't only in the debugger, show your code please, because in that case, your scenario doesn't make any sense unless there is a poorly made conversion going on).

If you still don't understand, this is pretty much how printf works, it assumes that the precision is always 6 (which is the maximum size of a float), which means if the left hand decimal is higher, the precision is lower, so there are many scenarios where even though the value is small, it corrupts the value by quite a bit, for example:

 ``1234567891011`` ``````#include //from https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/scanf-scanf-l-wscanf-wscanf-l?view=vs-2015#example int main() { float fp; wchar_t wc, ws[81]; sscanf("98.001", "%f",&fp); printf( "The number of fields input is %f\n", fp ); //prints 98.000999 }``````

And this is fixable by doing this:
 ``12`` ``````//needs , lower the precision more if the number is higher. printf( "The number of fields input is %.*f\n", FLT_DIG-1, fp );``````

But this is a very limited solution because when the decimal side goes up, the less precise it is, the more you subtract, sometimes you flip a coin and it shows a rounded number, sometimes it doesn't.

Or you could use a double, the above code will work (if you change sscanf to use "%ld"), and it should always work much better than float since the decimal can be higher and still be in under 6 precision.

Overall, I assume the debugger also does something similar, and decided to make the precision maximum, printf style, when it really shouldn't (but perhaps it's because of performance reasons, legacy code, debugging floating point number purposes, who knows).

You can also test this out by setting the precision to DLB_DIG, but no "-1", and now 1.001 will have repeating nines, just like how float will have repeating nines if you gave it 98.001..., but note that a precision of 16 is quite large and means you could have 15 repeating zeros before it cannot store data, so to say double is worse than float is a long shot.

This mostly has to do with the fact that like the posters above have already said, float has 6 or 7 significant digits, so it is just a coincidence that 1.001 fits perfectly in a float without the lack of precision coming from the fact that the number is simulated by a inverse power of 2 and somehow ends up rounding the number up to 1.001, and for doubles, its a coincidence that 1.001 is a unlucky number which if you present a double to its maximum most realistic value, like how your debugger does it (even if you round it, any forward operations will still use the original off by a zillion value simply because that is how it can only be represented in memory).

You should fix your typo of `%ld` because that is a long "decimal" (unless there is something I don't know about vs specifiers).

And like other posters have said before, you cannot have exacts, you must use relatives, you should be familiar with something called an "epsilon", or if you restraint yourself to only using values from 0 to 1.0, the values will be absolute and not corrupted (if you print it in the right amount of precision).

And remember that long doubles is equal to doubles in visual studios. And as an extra, the same is with long and long long.
 I understand that representing decimal values with binary digits has its limitations, but also expected 1.001 could be accurately represented as a sign, mantissa and exponent with binary values?

Nope. Very very very few exact values can be represented. 1.001 is not one of them.

Take a look at how 1.001 is represented as a 32 bit float: https://www.h-schmidt.net/FloatConverter/

Experiment with what numbers can be exactly represented. You'll quickly get a feel for it.

The value actually stored is 1.00100004673004150390625 for a 32 bit float.

Similarly for a double.
Last edited on
since 64 bit machines, using a base-10 offset has been viable if you actually NEED the precision and you have a known small number of digits needed.

here for example, just store 1001 as an integer and understand that the value is shifted 3 decimal places. This may or may not be useful to your needs, but it often is (its great for money, for example, where 3 decimal places is about all you need)
Last edited on
kryptonjohn wrote:

double d = 1.001L;
when inspecting d with the debugger shows 1.0009999999999999?
Yet
float f = 1.001;
when inspecting f with debugger shows 1.0010000?
Double is somehow worse than float?

The debugger is not showing the actual numbers, it's rounding them.

As already pointed out, the decimal value 1.001 cannot be represented in binary.

it lies between these two valid double values:
1.000999999999999889865875957184471189975738525390625 (hex 0x1.004189374bc6ap+0)
1.0010000000000001119104808822157792747020721435546875 (hex 0x1.004189374bc6bp+0)

It is slightly closer to the first one, so when you write "1.001" in code, the compiler writes 1.000999999999999889865875957184471189975738525390625 instead.

when you wrote `float f = 1.001`, you asked the compiler to convert 1.000999999999999889865875957184471189975738525390625 to float: that value is not a valid float: it lies between these two float values:
1.00100004673004150390625 (hex 0x1.00418ap+0)
1.000999927520751953125 (hex 0x1.004188p+0)

It is slightly closer to the first one, so the compiler stored "1.00100004673004150390625f" in the variable f (incidentally, it's also what the compiler writes when you write "1.001f")
Last edited on
Thanks all. I like jonnin's idea... might just roll my own numeric class storing the value in a long with two values for decimal position and decimal precision. That will make adding all the operators really simple.

Had a good smile when someone suggested this was for an assignment. This is actually a change to some software I originally wrote it 1998 (!) and the customer is still using it. I've come back to so some upgrades to it after all this time.

you may be able to use a bitfield for that more efficiently. Not really knowing what you really want to do, but there are a bunch of efficient options. If its 20 years old software, efficient is probably unimportant now since your target machine is literally 100 times faster than it was.

Whatever you are doing has probably been done before. If it gets complicated, look online... your suggestion is something I did way back called a 'scaled integer'. Databases use these ideas a lot as well.
Last edited on

 In my application I am receiving a number "1.001" represented as a string and I need to convert it into a floating point number. The range of possible values is from 0.001 to 999.999 with 3 digit maximum precision.

 Had a good smile when someone suggested this was for an assignment. This is actually a change to some software I originally wrote it 1998 (!) and the customer is still using it. I've come back to so some upgrades to it after all this time.

Well, why can't you just use the existing facilities to convert string to double? If you only want to display to 3dp, then why not do that with `setprecision` ? Hence my guess it was an assignment.

What do you actually want to do in detail?