Converting SIGNED fractions to UNSIGNED fixed point for addition and multiplication

Question

How can we convert floating point numbers to their "fixed-point representations", and use their "fixed-point representations" in fixed-point operations such as addition and multiplication? The result in the fixed-point operation must yield to the correct answer when converted back to floating point.

Say:

(double)(xb_double) + (double)(xb_double) = ?

Then we convert both addends to a fixed point representation (integer),

(int)(xa_fixed) + (int)(xb_fixed) = (int) (xsum_fixed)

To get (double)(xsum_double), we convert (int)(sum_fixed) back to floating point and yield same answer,

FixedToDouble(xsum_fixed) => xsum_double

Specifically, if the range of the values of xa_double and xb_double is between -1.65 and 1.65, I want to convert xa_double and xb_double in their respective 10-bit fixed point representations (0x0000 to 0x03FF)

WHAT I HAVE TRIED

int fixed_MAX = 1023;
int fixed_MIN = 0;
double Value_MAX = 1.65;
double Value_MIN = -1.65;

double slope = ((fixed_MAX) - (fixed_MIN))/((Value_MAX) - (Value_MIN));

int DoubleToFixed(double x)
{
return round(((x) - Value_MIN)*slope + fixed_MIN); //via interpolation method
}

double FixedToDouble(int x)
{
return (double)((((x) + fixed_MIN)/slope) + Value_MIN);
}

int sum_fixed(int x, int y)
{
    return (x + y - (1.65*slope)); //analysis, just basic math
}

int subtract_fixed(int x, int y)
{
    return (x - y + (1.65*slope));
}

int product_fixed(int x, int y)
{
    return (((x * y) - (slope*slope*((1.65*FixedToDouble(x)) + (1.65*FixedToDouble(y)) + (1.65*1.65))) + (slope*slope*1.65)) / slope);
}

And if I want to add (double)(1.00) + (double)(2.00) = which should yield to (double)(3.00),

With my code,

xsum_fixed = DoubleToFixed(1.00) + DoubleToFixed(2.00);
xsum_double = FixedToDouble(xsum_fixed);

I get the answer:

xsum_double = 3.001613

Which is very close to the correct answer (double)(3.00)

Also, if I perform multiplication and subtraction I get 2.004839 and -1.001613, respectively.

HERE'S THE CATCH:

So I know my code is working, but how can I perform addition, multiplication and subtraction on these fixed-point representations without having INTERNAL FLOATING POINT OPERATIONS AND NUMBERS.

So in the code above, the functions sum_fixed, product_fixed, and subtract_fixed have internal floating point numbers (slope and 1.65, 1.65 being the MAX float input). I derived my code by basic math, really.

So I want to implement add, subtract, and product functions without any internal floating point operations or numbers.

UPDATE:

I also found a simpler code in converting fractional numbers to fixed-point:

//const int scale = 16; //1/2^16 in 32 bits

#define DoubleToFixed(x) (int)((x) * (double)(1<<scale))
#define FixedToDouble(x) ((double)(x) / (double)(1<<scale))
#define FractionPart(x) ((x) & FractionMask)

#define MUL(x,y) (((long long)(x)*(long long)(y)) >> scale)
#define DIV(x, y) (((long long)(x)<<16)/(y))

However, this converts only UNSIGNED fractions to UNSIGNED fixed-point. And I want to convert SIGNED fractions (-1.65 to 1.65) to UNSIGNED fixed-point (0x0000 to 0x03FF). How can I do this with the use of this code above? Is the range or number of bits have something to do with the conversion process? Is this code only for positive fractions?

credits to @chux

@chux thanks for the interpolation method! Please feel free to answer this one. — chaine09, Dec 07 '15 at 02:56
Why is there a -1 in my question? What wrong with you people! — chaine09, Dec 07 '15 at 03:01
Not sure but I think you want `return (double)((((x) + fixed_MIN)/slope) + Value_MIN);` --> `return (double)((((x) - fixed_MIN)/slope) + Value_MIN);` (OTOH, I see that term is 0 so adding or subtracting make little difference. — chux - Reinstate Monica, Dec 07 '15 at 03:04
@chux your conversion is correct, but you can't add these fixed-point values directly to yield to the correct answer. Also, my add function involves internal floating point operations and numbers. — chaine09, Dec 07 '15 at 03:09
You _can_ add them. Why do you say that you cannot? You added them and the result was about 3.00 - so what is the problem? You added 2 `int`s and the sum was as expected. — chux - Reinstate Monica, Dec 07 '15 at 03:10
@chux because if I add FixedToDouble(DoubleToFixed(1.00) + DoubleToFixed(2.00)), I will get 4.653226 which is not equal to 3.00. — chaine09, Dec 07 '15 at 03:14
@chux 1.00 and 2.00 are double. I convert them both to fixed, performed regular addition, and converted it back to double and get 4.653226 — chaine09, Dec 07 '15 at 03:15
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/97151/discussion-between-chux-and-user2569770). — chux - Reinstate Monica, Dec 07 '15 at 03:17
@chux Likewise, FixedToDouble(DoubleToFixed(0.4) + DoubleToFixed(0.4)), the answer is 2.453226 — chaine09, Dec 07 '15 at 03:19
@chux the values I got solving a difference equation (iterative in nature, solving for ouput Y[I], where i = 0, 1, 2,...) are not so accurate compared to the actual values, since error accumulate? is this normal? — chaine09, Dec 07 '15 at 04:05
@chux The actual values = 0.989100 0.969041 0.953589 0.944778 0.943678 0.950274 0.963490 0.981357 1.001309 1.020548 1.036433 1.046846 1.050468 1.046947 1.036924 1.021923 1.004121 0.986038 0.970173 0.958669 -0.036082 -0.015167 0.007387 0.028432 0.045096 0.055176 0.057423 0.051697 0.038960 0.021124 0.000759 -0.019268 -0.036202 -0.047772 -0.052499 -0.049878 -0.040432 -0.025610 -0.007576 0.011112 — chaine09, Dec 07 '15 at 04:07
@chux answers from my code = 0.988710 0.966129 0.946774 0.933871 0.930645 0.937097 0.953226 0.975806 1.001613 1.024194 1.040323 1.050000 1.050000 1.040323 1.024194 1.001613 0.975806 0.953226 0.937097 0.930645 -1.688710 -1.624194 -1.569355 -1.527419 -1.504839 -1.504839 -1.527419 -1.566129 -1.614516 -1.669355 -1.720968 -1.762903 -1.795161 -1.811290 -1.808065 -1.788710 -1.753226 -1.708065 -1.659677 -1.617742 quite accurate for the firstr values but not for values at the bottom — chaine09, Dec 07 '15 at 04:15
Even if you get the math right, the basic theory of using the IEEE-754 bits as an unsigned long to add, multiply and subtract but then get a floating point number back out that bears any relationship to the original is just flat wrong. The reason being is the encoded floating point notation is made up of a *sign-bit*, *-127 encoded exponent*, and *significand/mantissa*. Any (meaning *any*) operation on the bits that make up this encoded floating point representation will destroy any relationship it has to the original. So if that was the goal, you need to toss it out as a *wild idea*... — David C. Rankin, Dec 07 '15 at 05:56
@David C. Rankin, can you explain further? I am not familiat with the standard and my main goal os just represent my input which can be any real number in the range of -1.65 and 1.65. So instead of initializing it as double, I want to convert it to a 10-bit integer reprrsentation (0x0000 to 0x03FF), and mentioned by chux, this isn't fixed point math really. I just want my inputs to be converted to unsigned 10-bit integers so that I can perform regular arithmetic operations, then just convert my final answer back to its original real-number representation. — chaine09, Dec 07 '15 at 06:33
Sure, when you say *convert floating point numbers to their "fixed-point representations"* if you are talking about using the 64-bit (or 32-bit) `unsigned long` number that is made up of the floating point bits, doing math, and getting a floating point back related to the first -- that will not work. Yes, you can read the bits of a floating point as an unsigned number, but changing any bit will destroy the floating point makeup. Floating point number are stored in IEEE-754 format (search this site). It is a special format encoding the floating point value. Changing the bits, destroys that. — David C. Rankin, Dec 07 '15 at 08:17
@David C. Rankin, what I wish to accomplish is instead of representing real number inputs as floats, I want to represent them as 10-bit unsigned integers for addition and multiplication, and if I convert them back to float, it will yield to the same answer. Is there a standard way to convert floats to n-bit unsigned integers? — chaine09, Dec 07 '15 at 11:38
Before going further, let's make sure we are on the same sheet of paper. Look at [**Converting to/from IEEE 754 single-precision floating point format**](http://teaching.idallen.com/cst8281/10w/notes/100_ieee754_conversions.txt) That will explain the logistics of the conversion process. If you want a *10-bit* representation, that would mean your accuracy would be limited too a *6-bit* mantissa. That is the only way I see you being able to do the type of conversion you are talking about. Given the normalization of the mantissa, I'm not sure you can do math on it without removing the norm first. — David C. Rankin, Dec 07 '15 at 15:30
@DavidC.Rankin BTW I am converting from double to int, I have to check out the link first — chaine09, Dec 07 '15 at 15:33
@DavidC.Rankin so you're saying that the above code is wrong? The interpolation part? I didn't consider the inputs as double, but regular real numbers. — chaine09, Dec 07 '15 at 15:35
Ummm... You mean `double` to `long int` right? *IEEE-754 single-precision* format (32-bits) -- *IEEE-754 double-precision* format (64-bits). What I'm saying is the interpolation can be 100% correct, but you are still not getting the correct float back -- right? Right. (if chux helped with the interpolation, then they are 100% correct) The problem is how the floating point numbers are stored. Read/understand the link I posted - do a couple of conversions by hand (it has examples), then I think you will understand what I, and others, are saying about the problem you are facing. — David C. Rankin, Dec 07 '15 at 15:35
@DavidC.Rankin https://www.youtube.com/results?search_query=intro+to+fixed+point this video tutorial does not bother with the standard of floating numbers, but a simple multiplication of scaling factors to convert to fixed-point. However, I still can't deduce how I can represent my signed real number input as unsigned 10-bit integer output from this. — chaine09, Dec 07 '15 at 15:40
I'm sorry, I don't Utube, (too old), but this one on the same page looked like one you should see [**Into to floating point**](https://www.youtube.com/watch?v=8M3zllpb1zA) All the rest of the videos listed were `fixed point` and none were `conversion from `floating point to fixed point`. — David C. Rankin, Dec 07 '15 at 15:46
@DavidC.Rankin I read the first few examples in the link. "If you want a 10-bit representation, that would mean your accuracy would be limited too a 6-bit mantissa" how did you come up with a 6-bit mantissa? — chaine09, Dec 07 '15 at 15:47
@DavidC.Rankin my task is to convert real numbers to "10-bit unsigned integer", so with IEEE 754 it contains a sign bit, but the integer equivalent (say, 32 bits) does not take on negative values. So this is okay, right? — chaine09, Dec 07 '15 at 15:51
@DavidC.Rankin please enlighten me a bit more, since the standard is to complex to code, moreso I have to use this 10-bit representations in arithmetic operations and convert it back to double. Is there a simpler way? — chaine09, Dec 07 '15 at 15:55
@DavidC.Rankin is the code above utterly incorrect? since I did not consider the standard? — chaine09, Dec 07 '15 at 16:01
@DavidC.Rankin yes, none really address the problem of conversion. But the alternative representation of real numbers by fixed point instead of floating point. I'm not sure what I have to do, convert float to fixed or directly convert to fixed. I am given an analog input which is sampled at discrete instances of time. Which means the values it take is continuous and can be represented as float or as fixed. — chaine09, Dec 07 '15 at 16:32
@DavidC.Rankin do I have to get bugged down with the details of the IEEE 745 standard and the bits comprising teh floating point representation or just focus on the integer math, then just convert values to (double) or (int) which ever I intend to have? — chaine09, Dec 07 '15 at 16:36
This is where it really depends on what you are doing. If you are starting with a *floating point value* and you want to do *anything* with those bits and expect to get *anything* meaningful back in a *floating point* value that has *any* relationship to the original *floating point* value -- then YES, you have to wade through all of that MUCK. (that may be why you don't find any handy videos on doing what it is you are attempting to do -- no?) — David C. Rankin, Dec 07 '15 at 16:41
@DavidC.Rankin yes, but I was thinking is not consider it as a floating point but consider the input as a fraction that I want to represent as an integer, so that I can perfrom regular operations. Can't I do it that way? — chaine09, Dec 07 '15 at 18:36
To all, please don't downvote especially if you can't answer. Thanks — chaine09, Dec 08 '15 at 00:33

Anton Knyazyev · Answer 1 · 2015-12-07T15:41:15.803

1

You can have the mantissa of the floating point representation of your number be equal to its fixed point representation. Since FP addition shifts the smaller operand's mantissa until both operands have the same exponent, you can add a certain 'magic number' to force it. For double, it's 1<<(52-precision) (52 is double's mantissa size, 'precision' is the required number of binary precision digits). So the conversion would look like this:

union { double f; long long i; } u = { xfloat+(1ll<<52-precision) }; // shift x's mantissa
long long xfixed = u.i & (1ll<<52)-1; // extract the mantissa

After that you can use xfixed in integer math (for multiplication, you'd have to shift the result right by 'precision'). To convert it back to double, simply multiply it by 1.0/(1 << precision);

Note that it doesn't handle negatives. If you need them, you'd have to convert them to the complementary representation manually (first fabs the double, then negate the int result if the input was negative).

edited Dec 07 '15 at 15:41

answered Dec 07 '15 at 15:03

Anton Knyazyev

464
4
7

can you please explain your code further? I don't quite understand it. What will I have to do to perform integer operations in a nutshell? My range is -1.65 to 1.65, and I have to convert it to 10-bit unsigned integer from 0x0000 to 0x03FF. I still don't get the floating point representation (mantissa, signed bit, exponent) – chaine09 Dec 07 '15 at 15:22
1ll<<52-precision what is 1ll? – chaine09 Dec 07 '15 at 15:23
Can you put comments as to what parts of your code does? – chaine09 Dec 07 '15 at 15:30
2

you'd have to decide how many fixed precision digits you want. since your range is 1.65*2=3.3, you need to reserve at least 2 bits for the integer part, so your fixed numbers will be 2.8 (i.e. precision = 8 in the formula). if your range were >0, you could use xfixed directly in integer math. with the negative range, you'd have to add 1.65 to xfloat. + and - would still work, but not * – Anton Knyazyev Dec 07 '15 at 15:36
do I have to take into account the IEEE 745 standard? – chaine09 Dec 07 '15 at 15:58
I don't quite understand. I know that the standard for double-precision floating points have 64 bits (1 sign bit, 11 exponent bits, 52 mantissa bits). Then how do convert it to a 10-bit fixed-point representation which can take the values from 0x0000 => -1.65 to 0x03FF => 1.65 in a nutshell? I just have to get the gist of the code I will be doing. It's kind of hard to digest all of the information as of this point. – chaine09 Dec 07 '15 at 16:07
1

my code converts double to 2.8 fixed (except replace xfloat with xfloat+1.65 since you need negatives). i and f are just to get access to double's binary representation – Anton Knyazyev Dec 07 '15 at 16:11
uhm, can this work also? https://www.youtube.com/results?search_query=intro+to+fixed+point in converting to 10-bit fixed point with 2,8 precision? – chaine09 Dec 07 '15 at 16:18
How can I use your code? union { double f; long long i; } u = { xfloat+(1ll<<52-precision) }; // shift x's mantissa long long xfixed = u.i & (1ll<<52)-1; // extract the mantissa – chaine09 Dec 07 '15 at 16:24
what is the precision? – chaine09 Dec 07 '15 at 16:28
precision is 8. truthfully, my code is not much better than the natural way (where you just scale and cast to int, as in the video), maybe a bit faster. note that the conversion is not lossless in either case, since you are cropping precision to 8 bits, and 1.65 needs more in the binary form – Anton Knyazyev Dec 07 '15 at 16:40
How can I make use of the one in the video with 2.8 precision. And with range -.165 and 1.65 (has negative numbers). It was not stated in the video how to choose the scaling factor. Is it dependent on the range? Wjere does the number of bits come in? It's kind of hard for me to understand your code. Can you also add a function for addition and multiplication? Thanks for your help! – chaine09 Dec 07 '15 at 18:02
well, that's a half an hour video, you'd have to be more specific:) – Anton Knyazyev Dec 07 '15 at 18:54
Oh, the conversion to fixed-point :) it's equivalent to your code, right? – chaine09 Dec 07 '15 at 19:02
And I am really confused. What does 2.8 mean? A 10-bit fixed point representation? Why 2.8? Where in your code says something about being 2.8? Does the range of the input matter? 10 bits? I'm confused now. – chaine09 Dec 07 '15 at 19:06
2.8 means two integer bits plus 8 fraction bits, for a total of ten bits. You want to represent [-1.65,1.65] as an unsigned fixed point operand, which means you will need to implicitly re-bias the representation to [0,3.30]: this requires 2 integer bits. – njuffa Dec 07 '15 at 19:33
i think they are roughly the same (except for the lowest bit rounding). 2.8 is inteter/fraction bits. you plug in 8 as 'precision' in my code. 2.8 is because you'd want to have the maximum precision while still having enough bits to store the integer part – Anton Knyazyev Dec 07 '15 at 19:36
Precision in your code os same as scale in the video right? – chaine09 Dec 07 '15 at 23:54
The video uses 32 bits and scale = 16. So mine is 10 bits and scale = 8 bits right? What do you mean by the lowest bit rounding? Is the code I presented above not valid? – chaine09 Dec 07 '15 at 23:56
@njuffa how can you rebias it to [0, 3.30]? – chaine09 Dec 07 '15 at 23:57
@user2569770: By adding 1.65 to the true value. – njuffa Dec 08 '15 at 01:14
@njuffa then proceed to converting, and when I convert the final answer back to double I just have to subtract 1.65? – chaine09 Dec 08 '15 at 01:27
@njuffa can you please check my code above which uses interpolation method and involves adding 1.65. x_fixed = slope*(x_float + 1.65). Thanks! – chaine09 Dec 08 '15 at 02:09
@user2569770 To be honest, I have no clue what exactly you are trying to do, as your questions are confusing rather than clarifying to me. But since you insist (for reasons I do not understand) that the fixed-point processing must be done with unsigned arithmetic, it would seem to me that you need to re-bias your raw data such that it is greater than or equal to 0. – njuffa Dec 08 '15 at 05:39
@njuffa, what I'm trying to do is model a digital filter with a difference equation. The inputs will be an analog signal sampled at a certain frequency. The values of the inputs are real numbers since they were sampled from an analog signal. I have an option to represent these values as double. – chaine09 Dec 08 '15 at 05:55
@njuffa And perform double operations? And output a double. But what I want to do is convert the input to 10-bit unsigned integer representations instead of double. And because of this I am guessing that the internal operations in the difference equation such as addition and subtraction must changed. It should have no floating point numbers and operations. It should be modified for 10-bit unsigned integers it will operate on. So I have to make a function that adds and multiplies these 10-bit unsigned integers. – chaine09 Dec 08 '15 at 05:56
@njuffa And get a 10-bit unsigned integer output. When I convert back to its original value (real number) it must be correct. – chaine09 Dec 08 '15 at 06:00

Converting SIGNED fractions to UNSIGNED fixed point for addition and multiplication

1 Answers1

Linked