Float to fixed conversion

Question

This is a basic question but I am confused.

I have a register which has the format 1.4.12. Meaning it takes a float and takes the range -15.9999 - 15.9999, is that correct, or how many nines? I am confused by the range.

I need to convert a c++ float to fixed point and put it in the register? Are there any std:: libraries to do that in C? If not is there any standard code that someone could point me to?

Also, how to convert fixed to float would be good?

What “register” are you talking about? CPU registers usually don't work that way. — 5gon12eder, Feb 12 '15 at 17:15
What "register"? How do you get those numbers from that "format"? What have you tried so far? — Some programmer dude, Feb 12 '15 at 17:15
Binary numbers don't use nines for the limits. It's a binary number. So as .9999 is 9999/10000, in your case it would be 4095/4096. — Mark Ransom, Feb 12 '15 at 17:30
Its a 32bit register on some custom HW, where I must represent the number 1.4.12 format. — user1876942, Feb 12 '15 at 17:40
Do I really have to spell it out for you? The range is +/- (15 + 4095/4096). That's -15.999755859375 to 15.999755859375. — Mark Ransom, Feb 12 '15 at 17:50
Sorry, you don't have to come to a question forum and answer a question, you came here of your own free will, hopefully :) — user1876942, Feb 12 '15 at 17:54
This is a bad question, still. No information passed as comment, is added. Nothing clarifies it. — , Feb 12 '15 at 17:56

Paul R · Accepted Answer · 2015-02-12T17:41:51.500

3

It's fairly simple to do this yourself:

typedef int32_t fixed;

fixed float_to_fixed(float x)
{
    return (fixed)(x * 65536.0f / 16.0f);
}

Note that this has no range checking so if x can possibly be outside the valid range for your fixed point type then you might want to add some checks and either saturate or throw an error as appropriate.

Similarly for conversion in the other direction:

float fixed_to_float(fixed x)
{
    return (float)x * 16.0f / 65536.0f;
}

(This one does not need any range checking of course.)

edited Feb 12 '15 at 17:41

answered Feb 12 '15 at 17:16

Paul R

208,748
37
389
560

Thanks alot. I guess this is for 16bit max. But in my case I have 17. Do I just need to use 32.0f and 2147483647? Any links or tips to know the range? – user1876942 Feb 12 '15 at 17:36
1

Are you *sure* you have 17 bits ? I assumed that was a typo. It's pretty easy to adapt the above code to whatever fixed point format you need though. If it really is 17 bits then you need to specify what size word is used (24 bits ? 32 bits ?) and whereabouts in the word the 17 bits are located (e.g. MS or LS). – Paul R Feb 12 '15 at 17:38
1

I've updated the answer now to use a 32 bit word size and put the 17 fixed point bits in the least significant 17 bits of this 32 bit word. – Paul R Feb 12 '15 at 17:42
Yes, some are 17, others 18 and 22 etc... Each HW register is 32 bits in length. The place where I put them depends on the register it can be bits 0:16 and others 5:21. There are many variations. One word is 32bits. – user1876942 Feb 12 '15 at 17:44
OK - well the above examples should be enough to get you started - shifting the bits to the required position within the 32 bit word is easy enough of course. – Paul R Feb 12 '15 at 17:45
@PaulR: Can you expand on why you don't simply multiply by 4096? I don't understand why you're taking multiple steps... – Mooing Duck May 15 '15 at 02:13
@MooingDuck: the compiler will combine the constants resulting in a single multiply, so it amounts to the same thing, but expressing it this way in the code makes the intent more self-evident, as it shows the two constants separately. – Paul R May 15 '15 at 05:14
@PaulR: You say "having both constants makes the intent more self-evident", but I have no idea where you got those constants or what they mean. If I were writing the code, I would have written `static const int fraction_bits = 12; static const int fraction_adjustment = (1< – Mooing Duck May 15 '15 at 16:21
@MooingDuck: sure - I guess it just depends on what you are used to - to my mind fixed point has two parameters: word size and position of implied decimal point. So if the word size is 16 bits and there are 4 integer bits to the left of the DP then that translates to 2^16 = 65536 and 2^4 = 16. One could express it even more verbosely I suppose, e.g. 1<<16 and 1<<4 instead of 65536 and 16, – Paul R May 15 '15 at 16:35
@PaulR: Interesting. I view fixed point as simply storing the numerator of a fraction with a fixed denominator (in this case, 4096), and not worrying overmuch about the word size or implied decimal point. Different points of view is all. – Mooing Duck May 15 '15 at 17:09
1

Yes, like so many things it all depends on how you look at it. Note that fixed point formats are often specified as e.g. 1.31 or 4.12 where the two numbers indicate the number of bits before and after the decimal point. – Paul R May 15 '15 at 17:46

score 1 · Answer 2 · answered Jun 20 '20 at 19:27

If you need to use fixed-point, then you must implement addition and multiplication operations. In that case, you need to worry about how many bits you have allocated for the fractional part and how many bits allocated for the integer part. And then you can do "shift" operation as your preference.

In the following code-snippet, I've implemented fixed-point by allocating 22 bits for the fractional part and 9 bits for the integer part. (additional bit will be for the sign)

In multiplication, I've first expanded the bit-length of each value to avoid overflow. After multiplication, left shift will happen to keep the same fractional part for the output of multiplication.

In addition, I've added saturation for the output, in order to avoid any overflow (if overflow happens, then output will keep the maximum absolute value that it can keep irrespective of the sign)

#include <stdio.h>
#include <math.h>
#include <stdint.h>

#define fractional_bits 22
#define fixed_type_bits 32

typedef int32_t fixed_type;
typedef int64_t expand_type;

fixed_type float_to_fixed(float inp)
{
    return (fixed_type)(inp * (1 << fractional_bits));
}

float fixed_to_float(fixed_type inp)
{
    return ((float)inp) / (1 << fractional_bits);
}

fixed_type fixed_mult(fixed_type inp_1, fixed_type inp_2)
{
    return (fixed_type)(((expand_type)inp_1 * (expand_type)inp_2) >> fractional_bits);
}

fixed_type fixed_add(fixed_type inp_1, fixed_type inp_2)
{
    fixed_type inp_1_sign = inp_1 >> (fixed_type_bits - 1);
    fixed_type inp_2_sign = inp_2 >> (fixed_type_bits - 1);
    fixed_type add = inp_1 + inp_2;
    fixed_type add_sign = add >> (fixed_type_bits - 1);

    if (inp_1_sign != inp_2_sign)
    {
        return add;
    }
    else if (add_sign == inp_1_sign)
    {
        return add;
    }
    else if (add_sign == -1)
    {
        return ((1 << (fixed_type_bits - 2)) - 1 + (1 << (fixed_type_bits - 2)));
    }
    else if (add_sign == 1)
    {
        return (1 << (fixed_type_bits - 1));
    }
}

Float to fixed conversion

2 Answers2

Linked