To Float or Not to Float?

Have you ever wondered how accurate your floating point data types are for your number? You may be alarmed to find that it’s not exactly the number you tried to put in it. The answer isn’t necessarily an obvious calculation. So for your reading pleasure, I took some time to figure out two easy paths to the answer.

1. The ‘Simple’ Case Method: If your maximum number’s absolute value is exactly a power of 2 (e.g. 64), just divide it by 2^24 for float, or 2^53 for double. The answer is your floating point approximation’s resolution (the smallest increment for that approximation). The same idea can be applied for any size of floating point number. See the below mantissa description for scaling information.

2. Lookup Table Method: If the number’s absolute value cannot be exactly represented in 2^n form, the below look-up table may help you quickly figure out what 2^n exponent you should be using for determining the number’s resolution. If you want to cut to the chase, skip to the table usage instructions!

Why can figuring out resolution of float or double be challenging?

First of all, the resolution of a floating point number can ‘float’ around. Different numbers give you different amounts of granularity.

An important thing to understand is that floating point numbers is just a stored approximation of an original number. When there are enough significant digits to be very-very close to your number, the approximation becomes ‘exact enough’. However, for a large number with decimal places, a float may truncate/round the decimal places. (e.g. float(123456.1234567) = 123456.100000))

‘Resolution’ is the smallest increment you can represent (resolve exactly) in your floating point number.

The larger the number, the worse the float’s resolution gets. If you can define the worst case maximum value of your variable, you can define the worst case resolution of a floating point approximation.

The main hurdle to understanding a floating point approximation’s resolution involves our normal thinking in decimal (base 10). The resolution of a floating point number on the other hand is discovered in binary (base 2). In order to convert a decimal floating point number to its floating point bits, there are multiple steps which will require near complete understanding of the format. There are many other articles about doing that, so I’ll just cover my look-up table method.

Basics: The parts of an IEEE 754 Floating Point Number

There are three main parts to an IEEE floating point number. They are:

1. Sign (1 bit)

0 is positive, 1 is negative

2. Mantissa (Float: 23 bits, Double: 52 bits)

These are the most significant bits of your floating point number, excluding the actual MSB. There are three things that floating point approximation does to generate a mantissa and exponent.

1. Convert decimal floating point into “scientific notation style” binary floating point

e.g. 50.5 = 0b110010.1 -> 0b1.100101 * 2^5

2. The MSB 1 gets omitted, and the mantissa stores only “0b100101″
- Note that this is because the MSB is always a 1, so it can be omitted without loss of accuracy

3. The exponent of 5 (0b101) gets stored as well

To get an idea of how many different numbers can be represented, a 32 bit float allocates 23 bits for the mantissa, and with the implied MSB, it can represent up to 24 bits of resolution.

On the other hand, a 64 bit double allocates 52 bits for the mantissa, and with the implied MSB, it can represent up to 53 bits of resolution.

So how many bits of resolution do we have if we’ve got 234 bits of the mantissa? Well, add 1 to that, and we’ve got 235 bits of resolution! This is the primary number we will be using to determine what your resolution is.

3. Exponent (of 2^n) (Float: 8 bits bias 128, Double: 11 bits bias 1023)

Derived from shifting the binary number into scientific notation style format, this exponent is stored with a bias format (bias+exponent), versus sign-magnitude format. This allows both positive and negative exponents to be stored.

Any other information beyond the storage capacities of mantissa and exponent will be lost in the approximation process. Some applications will require way beyond what float and double can provide, and that is important to know before you wrap up your software design.

Usage instructions for determining float/double resolution for a given max number

The provided table has two main columns:

Column 1: Decimal Value, e.g. 128
Column 2: Exponent for Equivalent Binary Place, e.g. 7 is for 2^7=128

Step 1. Find max absolute value of a particular variable in question
- Call this ‘max_abs_value’

Step 2. Look up the two adjacent bounding numbers of max_abs_value in column 1
- If your number is exactly a row’s decimal value, use that row for step #3

Step 3. Use lower bounding row and look at the column 2 for the exponent value
- This is the maximum 2^n exponent used to represent your number
- Call this ‘max_binary_exponent’

Step 4. Subtract bits of resolution in your data type from the max_binary_exponent
- Call this min_binary_exponent
- Float: 24 bits of resolution
- Double: 53 bits of resolution
- If you’re looking at a non-float/double, use the mantissa (significand) size in bits + 1

Step 5. Either:

a. Run (2^min_binary_exponent) through your calculator, and you’ve got the resolution. Skip step 6.

or

b. Look up the row which has min_binary_exponent in column 2

Step 6. In the same row, look at the decimal value
- This is the ‘resolution’ of your floating point number
- e.g. your number is represented in increments of ‘resolution’
- If the decimal value is not absolute, or you want to verify accuracy,
you can do your own calculation with: 2^(min_binary_exponent)

Step 7. Decide if float or double provides enough resolution for your variable

Usage example

1. Looking at a current measurement in amps, which is about 3.5 amps max, max=3.5

2. Searching through the table yields these two adjacent bounding rows about 3.5

Col 1:
Decimal Value
Col 2:
2^n Exponent Value
Comment
2 1 Lower bounding number, using this for step #3
4 2 Upper bounding number

3. Calculating: max_binary_exponent=1

4. Calculating for Float: min_binary_exponent= max_binary_exponent – 24 = -23
Calculating for Double: min_binary_exponent= max_binary_exponent – 53 = -52

5. Lookup Results:

Col 1:
Decimal Value
Col 2:
2^n Exponent Value
Comment
0.00000011920929 -23 Worst Case Single Float Resolution
for Max of 3.5
0.0000000000000002220446049250310 -52 Worst Case Double Float Resolution
for Max of 3.5

6. Formatted Table Decimal Values in Engineering Notation

Float Resolution for Max Value of 3.5: 119.209290000000e-009 [A]
Double Resolution for Max Value of 3.5: 222.044604925031e-018 [A]

And that’s it! Enjoy.

Decimal to Binary Exponent Table

Col 1:
Decimal Value
Col 2:
2^n nValue
Col 3:
Scientific Notation
0.0000000000000000000000008271806125530280 -80 8.27180612553028000000E-25
0.0000000000000000000000016543612251060600 -79 1.65436122510606000000E-24
0.0000000000000000000000033087224502121100 -78 3.30872245021211000000E-24
0.0000000000000000000000066174449004242200 -77 6.61744490042422000000E-24
0.0000000000000000000000132348898008484000 -76 1.32348898008484000000E-23
0.0000000000000000000000264697796016969000 -75 2.64697796016969000000E-23
0.0000000000000000000000529395592033938000 -74 5.29395592033938000000E-23
0.0000000000000000000001058791184067880000 -73 1.05879118406788000000E-22
0.0000000000000000000002117582368135750000 -72 2.11758236813575000000E-22
0.0000000000000000000004235164736271500000 -71 4.23516473627150000000E-22
0.0000000000000000000008470329472543000000 -70 8.47032947254300000000E-22
0.0000000000000000000016940658945086000000 -69 1.69406589450860000000E-21
0.0000000000000000000033881317890172000000 -68 3.38813178901720000000E-21
0.0000000000000000000067762635780344000000 -67 6.77626357803440000000E-21
0.0000000000000000000135525271560688000000 -66 1.35525271560688000000E-20
0.0000000000000000000271050543121376000000 -65 2.71050543121376000000E-20
0.0000000000000000000542101086242752000000 -64 5.42101086242752000000E-20
0.0000000000000000001084202172485500000000 -63 1.08420217248550000000E-19
0.0000000000000000002168404344971010000000 -62 2.16840434497101000000E-19
0.0000000000000000004336808689942020000000 -61 4.33680868994202000000E-19
0.0000000000000000008673617379884040000000 -60 8.67361737988404000000E-19
0.0000000000000000017347234759768100000000 -59 1.73472347597681000000E-18
0.0000000000000000034694469519536100000000 -58 3.46944695195361000000E-18
0.0000000000000000069388939039072300000000 -57 6.93889390390723000000E-18
0.0000000000000000138777878078145000000000 -56 1.38777878078145000000E-17
0.0000000000000000277555756156289000000000 -55 2.77555756156289000000E-17
0.0000000000000000555111512312578000000000 -54 5.55111512312578000000E-17
0.0000000000000001110223024625160000000000 -53 1.11022302462516000000E-16
0.0000000000000002220446049250310000000000 -52 2.22044604925031000000E-16
0.0000000000000004440892098500630000000000 -51 4.44089209850063000000E-16
0.0000000000000008881784197001250000000000 -50 8.88178419700125000000E-16
0.000000000000001776356839400250 -49 1.77635683940025000000E-15
0.000000000000003552713678800500 -48 3.55271367880050000000E-15
0.000000000000007105427357601000 -47 7.10542735760100000000E-15
0.000000000000014210854715202000 -46 1.42108547152020000000E-14
0.000000000000028421709430404000 -45 2.84217094304040000000E-14
0.000000000000056843418860808000 -44 5.68434188608080000000E-14
0.000000000000113686837721616000 -43 1.13686837721616000000E-13
0.000000000000227373675443232000 -42 2.27373675443232000000E-13
0.000000000000454747350886464000 -41 4.54747350886464000000E-13
0.000000000000909494701772928000 -40 9.09494701772928000000E-13
0.000000000001818989403545860000 -39 1.81898940354586000000E-12
0.000000000003637978807091710000 -38 3.63797880709171000000E-12
0.000000000007275957614183430000 -37 7.27595761418343000000E-12
0.000000000014551915228366900000 -36 1.45519152283669000000E-11
0.000000000029103830456733700000 -35 2.91038304567337000000E-11
0.000000000058207660913467400000 -34 5.82076609134674000000E-11
0.000000000116415321826935000000 -33 1.16415321826935000000E-10
0.000000000232830643653870000000 -32 2.32830643653870000000E-10
0.000000000465661287307739000000 -31 4.65661287307739000000E-10
0.000000000931322574615479000000 -30 9.31322574615479000000E-10
0.000000001862645149230960000000 -29 1.86264514923096000000E-09
0.000000003725290298461910000000 -28 3.72529029846191000000E-09
0.000000007450580596923830000000 -27 7.45058059692383000000E-09
0.000000014901161193847700000000 -26 1.49011611938477000000E-08
0.000000029802322387695300000000 -25 2.98023223876953000000E-08
0.000000059604644775390600000000 -24 5.96046447753906000000E-08
0.000000119209289550781000000000 -23 1.19209289550781000000E-07
0.000000238418579101562000000000 -22 2.38418579101562000000E-07
0.000000476837158203125000000000 -21 4.76837158203125000000E-07
0.000000953674316406250000000000 -20 9.53674316406250000000E-07
0.000001907348632812500000000000 -19 1.90734863281250000000E-06
0.000003814697265625000000000000 -18 3.81469726562500000000E-06
0.000007629394531250000000000000 -17 7.62939453125000000000E-06
0.000015258789062500000000000000 -16 1.52587890625000000000E-05
0.000030517578125000000000000000 -15 3.05175781250000000000E-05
0.000061035156250000000000000000 -14 6.10351562500000000000E-05
0.000122070312500000000000000000 -13 1.22070312500000000000E-04
0.000244140625000000000000000000 -12 2.44140625000000000000E-04
0.000488281250000000000000000000 -11 4.88281250000000000000E-04
0.000976562500000000000000000000 -10 9.76562500000000000000E-04
0.001953125000000000000000000000 -9 1.95312500000000000000E-03
0.003906250000000000000000000000 -8 3.90625000000000000000E-03
0.007812500000000000000000000000 -7 7.81250000000000000000E-03
0.015625000000000000000000000000 -6 1.56250000000000000000E-02
0.031250000000000000000000000000 -5 3.12500000000000000000E-02
0.062500000000000000000000000000 -4 6.25000000000000000000E-02
0.125000000000000000000000000000 -3 1.25000000000000000000E-01
0.250000000000000000000000000000 -2 2.50000000000000000000E-01
0.500000000000000000000000000000 -1 5.00000000000000000000E-01
1.0 0 1.00000000000000000000E+00
2.0 1 2.00000000000000000000E+00
4.0 2 4.00000000000000000000E+00
8.0 3 8.00000000000000000000E+00
16.0 4 1.60000000000000000000E+01
32.0 5 3.20000000000000000000E+01
64.0 6 6.40000000000000000000E+01
128.0 7 1.28000000000000000000E+02
256.0 8 2.56000000000000000000E+02
512.0 9 5.12000000000000000000E+02
1024.0 10 1.02400000000000000000E+03
2048.0 11 2.04800000000000000000E+03
4096.0 12 4.09600000000000000000E+03
8192.0 13 8.19200000000000000000E+03
16384.0 14 1.63840000000000000000E+04
32768.0 15 3.27680000000000000000E+04
65536.0 16 6.55360000000000000000E+04
131072.0 17 1.31072000000000000000E+05
262144.0 18 2.62144000000000000000E+05
524288.0 19 5.24288000000000000000E+05
1048576.0 20 1.04857600000000000000E+06
2097152.0 21 2.09715200000000000000E+06
4194304.0 22 4.19430400000000000000E+06
8388608.0 23 8.38860800000000000000E+06
16777216.0 24 1.67772160000000000000E+07
33554432.0 25 3.35544320000000000000E+07
67108864.0 26 6.71088640000000000000E+07
134217728.0 27 1.34217728000000000000E+08
268435456.0 28 2.68435456000000000000E+08
536870912.0 29 5.36870912000000000000E+08
1073741824.0 30 1.07374182400000000000E+09
2147483648.0 31 2.14748364800000000000E+09
4294967296.0 32 4.29496729600000000000E+09
8589934592.0 33 8.58993459200000000000E+09
17179869184.0 34 1.71798691840000000000E+10
34359738368.0 35 3.43597383680000000000E+10
68719476736.0 36 6.87194767360000000000E+10
137438953472.0 37 1.37438953472000000000E+11
274877906944.0 38 2.74877906944000000000E+11
549755813888.0 39 5.49755813888000000000E+11
1099511627776.0 40 1.09951162777600000000E+12
2199023255552.0 41 2.19902325555200000000E+12
4398046511104.0 42 4.39804651110400000000E+12
8796093022208.0 43 8.79609302220800000000E+12
17592186044416.0 44 1.75921860444160000000E+13
35184372088832.0 45 3.51843720888320000000E+13
70368744177664.0 46 7.03687441776640000000E+13
140737488355328.0 47 1.40737488355328000000E+14
281474976710656.0 48 2.81474976710656000000E+14
562949953421312.0 49 5.62949953421312000000E+14
1125899906842620.0 50 1.12589990684262000000E+15
2251799813685250.0 51 2.25179981368525000000E+15
4503599627370500.0 52 4.50359962737050000000E+15
9007199254740990.0 53 9.00719925474099000000E+15
18014398509482000.0 54 1.80143985094820000000E+16
36028797018964000.0 55 3.60287970189640000000E+16
72057594037927900.0 56 7.20575940379279000000E+16
144115188075856000.0 57 1.44115188075856000000E+17
288230376151712000.0 58 2.88230376151712000000E+17
576460752303423000.0 59 5.76460752303423000000E+17
1152921504606850000.0 60 1.15292150460685000000E+18
2305843009213690000.0 61 2.30584300921369000000E+18
4611686018427390000.0 62 4.61168601842739000000E+18
9223372036854780000.0 63 9.22337203685478000000E+18
18446744073709600000.0 64 1.84467440737096000000E+19
36893488147419100000.0 65 3.68934881474191000000E+19
73786976294838200000.0 66 7.37869762948382000000E+19
147573952589676000000.0 67 1.47573952589676000000E+20
295147905179353000000.0 68 2.95147905179353000000E+20
590295810358706000000.0 69 5.90295810358706000000E+20
1180591620717410000000.0 70 1.18059162071741000000E+21
2361183241434820000000.0 71 2.36118324143482000000E+21
4722366482869650000000.0 72 4.72236648286965000000E+21
9444732965739290000000.0 73 9.44473296573929000000E+21
18889465931478600000000.0 74 1.88894659314786000000E+22
37778931862957200000000.0 75 3.77789318629572000000E+22
75557863725914300000000.0 76 7.55578637259143000000E+22
151115727451829000000000.0 77 1.51115727451829000000E+23
302231454903657000000000.0 78 3.02231454903657000000E+23
604462909807315000000000.0 79 6.04462909807315000000E+23
1208925819614630000000000.0 80 1.20892581961463000000E+24
This entry was posted in Embedded, Programming and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Security Code: