Why is it dangerous to convert integers to float16?

Question

I have run recently into a surprising and annoying bug in which I converted an integer into a float16 and the value changed:

>>> import numpy as np
>>> np.array([2049]).astype(np.float16)
array([2048.], dtype=float16)
>>> np.array([2049]).astype(np.float16).astype(np.int32)
array([2048.], dtype=int32)

This is likely not a bug, because it happens also for PyTorch. I guess it is related to half-float representation, but I couldn't figure out why 2049 is the first integer that is badly casted.

The question is not specially related to Python (I guess)

The IEEE 754 spec allows `float16` 11 bits for the "base", 5 for the exponent. I imagine that trying to represent 2049 you hit the limit of the bits for the base, `2 ** 11 == 2048`. However, I'm not sure this is exactly right, since we haven't accounted for the sign bit, which should take up yet one more bit from the base, leaving only 10 bits to represent a number. Source: https://en.wikipedia.org/wiki/IEEE_754 — axolotl, Mar 11 '22 at 20:55
not sure how one converts a comment, if that's a thing. I just copied it over. @Attersson — axolotl, Mar 11 '22 at 23:52
This happens with float32 and float64 too. Floating-point rounding error isn't limited to fractional values. — user2357112, Mar 12 '22 at 00:03
It is "dangerous" (in the sense of possibly losing information) to do any lossy conversion between two formats, and converting from a 32-bit integer to a 16-bit float is necessarily going to lose at least 16 bits of information; in practice, you will lose more because there are many floating-point values which can't be the result of converting from an integer. The question is why *wouldn't* you expect an imprecise conversion? Or is your question specifically why 2049 is the point at which this conversion starts being too imprecise for whole numbers? — kaya3, Mar 12 '22 at 00:42

score 2 · Accepted Answer · answered Mar 12 '22 at 01:45

2

You are right, its in general related to how floating-point numbers are defined (In IEEE 754 as others said). Lets look into it:

The float is represented by a sign s (here 1 bit), a mantissa m (here 10 bits) and an exponent e (here 5 bits for −14 ≤ e ≤ 15). The float x is then calculated by

x=s*[1].m*b**e,

where the basis b is 2 and [1] is a fixed (for-free) bit.

Up to 2**11 our integer number can be represented exactly by the mantissa, where

2** 11-1 is represented by m = bin(2**10-1) and e = bin(10)
2**11 is represented by m = bin(0) and e = bin(11)

then things get interesting:

2**11+1 can not be represented exactly by our mantissa and is rounded.
2**11+2 can be represented (by m = bin(0) and e = bin(11))

and so on...

Watch this video for detailed examples https://www.youtube.com/watch?v=L8OYx1I8qNg

answered Mar 12 '22 at 01:45

JDornheim

91
4

https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Precision_limitations_on_integer_values details the limit for float32, above which int->FP conversion can be inexact, rounding to even number, or multiple of 4, or higher powers of 2 as an ever larger exponent is required, meaning the least significant bit of the mantissa (1ulp) represents a change in value of an ever larger power of 2. https://www.h-schmidt.net/FloatConverter/IEEE754.html is a good way to play with that: try inputs of 16777217 and 16777219 to see the default FP rounding mode (nearest, even mantissa tiebreak) – Peter Cordes Mar 12 '22 at 11:09
In case anyone's wondering why a biased 5-bit exponent can only represent exponents of [-14, +15] for finite numbers, the lowest exponent (all-0 encoding) implies a leading 0 bit instead of 1 for the mantissa, i.e. subnormal numbers, but the same 2^-14 multiplier. (+-0.0 could be considered subnormal numbers where the mantissa happens to be zero, or just a special case if your HW doesn't support subnormals.) An all-ones exponent field means the value isn't finite, either +-Inf, or a NaN. https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Exponent_encoding – Peter Cordes Mar 12 '22 at 11:16

axolotl · Answer 2 · 2022-03-12T00:35:02.247

0

The IEEE 754 spec allows float16 11 bits for the significand (fraction), and 5 for the exponent. I imagine that trying to represent 2049 you hit the limit of the bits for the significand, 2 ** 11 == 2048.

I am unsure why 2049 becomes 2048, however.

Source: wikipedia:IEEE_754

edited Mar 12 '22 at 00:35

answered Mar 11 '22 at 23:51

axolotl

1,042
1
12
23

2

You're misreading the page. It's 11 significand digits, not 11 digits for the base. That 11 figure includes the implicit bit, which is always 1 and not explicitly stored. That's why you seem to have too many bits. (The base is not explicitly stored. The base is 2, because it's binary. There's no point explicitly storing that.) – user2357112 Mar 12 '22 at 00:02
1

Ah, I was actually using "base" to refer to the significand. Editing with correct terminology. – axolotl Mar 12 '22 at 00:34
Just like always, int->FP conversion uses the current FP rounding mode to round the actual value to a representable FP value, if the target format can't exactly represent a large integer. The default FP rounding mode is "nearest", with even (mantissa) as a tie-break. So it's exactly the same effect as rounding (float)16777217 rounding down to 16777216, and 16777219 rounding up to 16777220. (https://www.h-schmidt.net/FloatConverter/IEEE754.html and https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Precision_limitations_on_integer_values) – Peter Cordes Mar 12 '22 at 11:04

Why is it dangerous to convert integers to float16?

2 Answers2