How to convert floating point input to integers and preserve maximum precision?

Question

I have to use an algorithm which expects a matrix of integers as input. The input I have is real valued, therefore I want to convert the input it to integer before passing it to the algorithm.

I though of scaling the input by a large constant and then rounding it to integers. This looks like a good solution but how does one decide a good constant to be used, specially since the range of float input could vary from case to case? Any other ideas are also welcome?

In the question body, there is nothing concerning precision. — Seçkin Savaşçı, Sep 13 '12 at 12:36
@Seckin by `scaling the input by a large constant`, I expect (s)he means multiplying each float by that value before rounding; this is how precision is improved. If the maximum value can be determined such that the constant will not cause overflow then the maximum precision is attained within the width of the integer. — mah, Sep 13 '12 at 12:40
It will never overflow in my answer because all values before scaling is smaller than 1, so your `range_max_exclusive >= any element`. Pick a valid float value which has a valid integer value when casted/rounded and you are good with it. — Seçkin Savaşçı, Sep 13 '12 at 12:45
@SeçkinSavaşçı Like Mah commented above, the idea is to preserve maximum precision within the size of the integer. — stressed_geek, Sep 13 '12 at 12:45
@stressed_geek I gave another solution, which exactly does the same but without a downscaling. Actually it downscales it but not using a built-in normalize function. — Seçkin Savaşçı, Sep 13 '12 at 12:50
Is your input a 32-bit IEEE 754 floating-point value, a 64-bit 754 value, or something else? What size are the integers? Under what situations will the algorithm overflow? (E.g., if we scale to the maximum range of the integers, it may be impossible for the algorithm to do further arithmetic without overflowing.) Is translating the values as well as scaling them acceptable? — Eric Postpischil, Sep 13 '12 at 13:56
Pending answers to the above questions, scaling to something less than the maximum range of the integers may be desirable. E.g., selecting a power of two such that scaling by it produces integers in [-2**24, 2**24] is enough to capture all the bits of the significands of the largest values (if they are 32-bit IEEE 754) but leaves some room for further arithmetic without overflow. Also, scaling by a power of two avoids introducing additional rounding errors, although there will of course be rounding when converting to integer. — Eric Postpischil, Sep 13 '12 at 13:59

score 2 · Accepted Answer · answered Sep 13 '12 at 14:02

Probably the best general answer to this question is to find out what is the maximum integer value that your algorithm can accept as an element in the matrix without causing overflow in the algorithm itself. Once you have this maximum value, find the maximum floating point value in your input data, then scale your inputs by the ratio of these two maximum values and round to the nearest integer (avoid truncation).

In practice you probably cannot do this because you probably cannot determine what is the maximum integer value that the algorithm can accept without overflowing. Perhaps you don't know the details of the algorithm, or it depends in a complicated way on all of the input values. If this is the case, you'll just have to pick an arbitrary maximum input value that seems to work well enough.

Seçkin Savaşçı · Answer 2 · 2015-05-23T12:29:31.780

First normalize your input to [0,1) range, then use a common way to scale them:

f(x) = range_max_exclusive * x + range_min_inclusive

After that, cast f(x) (or round if you wish) to integer. In that way you can handle situations such as real values are in range [0,1) or [0,n) where n>1.

In general, your favourite library contains matrix operations, which you can implement this technique easily and with better performance than your possible implementation.

EDIT: Scaling-down then Scaling-up is sure to get lost some precision. I favor it because a normalization operation is generally comes with the library. Also you can do that without downscaling by:

f(x) = range_max_exlusive / max_element * x + range_min_inclusive

This answer translates as well as scales (it maps *x* to *ax* + *b*), but the question only requested scaling (*b* is zero). We do not know whether the subsequent algorithm produces a result that is invariant after translation (or can be corrected by some simple transformation). — Eric Postpischil, Sep 13 '12 at 16:10

How to convert floating point input to integers and preserve maximum precision?

2 Answers2