How to fix the position of binary point in an unsigned N-bit interger?

Question

I am working on developing a fixed point algorithm in C++. I know that, for a N-bit integer, the fixed point binary integer is represented as U(a,b). For example, for an 8 bit Integer (i.e 256 samples), If we represent it in the form U(6,2), it means that the binary point is to the left of the 2nd bit starting from the right of the form:

                   b5 b4 b3 b2 b1 b0 . b(-1) b(-2)

Thus , it has 6 integer bits and 2 fractional bits. In C++, I know there are some bit shift operators I can use, but they are basically used for shifting the bits of the input stream, my question is, how to define a binary fixed point integer of the form, fix<6,2> or U(6,2). All the major processing operation will be carried out on the fractional part and I am just finding a way to do this fix in C++. Any help regarding this would be appreciated.Thanks!

Example : Suppose I have an input discrete signal with 1024 sample points on x-axis (For now just think this input signal is coming from some sensor). Each of this sample point has a particular amplitude. Say the sample at time 2(x-axis) has an amplitude of 3.67(y-axis). Now I have a variable "int *input;" that takes the sample 2, which in binary is 0000 0100. So basically I want to make this as 00000.100 by performing the U(5,3) on the sample 2 in C++. So that I can perform the interpolation operations on fractions of the input sampling period or time.

PS - I don't want to create a separate class or use external libraries for this. I just want to take each 8 bits from my input signal, perform the U(a,b) fix on it followed by rest of the operations are done on the fractional part.

*I know there are some bit shift operators I can use, but they are basically used for shifting the bits of the input stream* hmm no.... It all depends on what you overloaded the operator to do. `x << 1` where `x` is an int will shift the bits of the int, no stream involved. — Borgleader, Feb 26 '15 at 12:14
From my understanding, I meant if I pass an input signal or samples or stream of bits. So, if the input at a particular time instant is a sample value (each such value is of the size 1 byte, i.e. 8 bits), so I want to perform the U(a,b) fix on each such input samples entering. I am beginner in programming (C and C++). Generally more used to doing these stuff in simulink. Pl correct me if wrong. — PsychedGuy, Feb 26 '15 at 12:30
@DigitalGeeK - use of fixed-point or floating point always imposes constraints. Mentioning your use-case may help select the ones that will not be detrimental. As it stands, to convert an unsigned 8 bit integer into a 6:2 fixed-point representation - all you need do is clear the lowest-order 2 bits. I.e `uint8_t myVal = 255; myVal &= 0x03;` You then need to shift the value two bits to the right to get the integer part back - being aware of course, that you've now divided your range by 4. The largest number you can now represent is 63.75 instead of 255. — enhzflep, Feb 26 '15 at 12:42
Suppose I have an input discrete signal with 1024 sample points on x-axis (For now just think this input signal is coming from some sensor). Each of this sample point has a particular amplitude. Say the sample at time 2(x-axis) has an amplitude of 3.67(y-axis). Now I have a variable "int *input;" that takes the sample 2, which in binary is 0000 0100. So basically I want make this as 00000.100 by performing the U(5,3) on the sample 2 in C++. So that I can perform the interpolation operations on fractions of the input sampling period or time. Hope this clears it, if not Pl ask. — PsychedGuy, Feb 26 '15 at 12:58

user3528438 · Accepted Answer · 2015-02-26T13:12:11.433

0

Short answer: left shift.

Long answer:

Fixed point numbers are stored as integers, usually int, which is the fastest integer type for a particular platform.
Normal integer without fractional bits are usually called Q0, Q.0 or QX.0 where X is the total number of bits of underlying storage type(usually int).
To convert between different Q.X formats, left or right shift. For example, to convert 5 in Q0 to 5 in Q4, left shift it 4 bits, or multiply it by 16.
Usually it's useful to find or write a small fixed point library that does basic calculations, like a*b>>q and (a<<q)/b. Because you will do Q.X=Q.Y*Q.Z and Q.X=Q.Y/Q.Z a lot and you need to convert formats when doing calculations. As you may have observed, using normal * operator will give you Q.(X+Y)=Q.X*Q.Y, so in order to fit the result into Q.Z format, you need to right shift the result by (X+Y-Z) bits.
Division is similar, you get Q.(X-Y)=Q.X*Q.Y form the standard / operator, and to get the result in Q.Z format you shift the dividend before the division. What's different is that division is an expensive operation, and it's not trivial to write a fast one from scratch.
Be aware of double-word support of your platform, it will make your life a lot easier. With double word arithmetic, result of a*b can be twice the size of a or b, so that you don't lose range by doing a*b>>c. Without double word, you have to limit the input range of a and b so that a*b doesn't overflow. This is not obvious when you first start, but soon you will find you need more fractional bits or rage to get the job done, and you will finally need to dig into the reference manual of your processor's ISA.

example:

float a = 0.1;// 0.1
int aQ16 = a*65536;// 0.1 in Q16 format
int bQ16 = 4<<16// 4Q16
int cQ16 = a*b>>16 // result = 0.399963378906250Q16 = 26212, 
                   // not 0.4Q16 = 26214 because of truncating error

edited Feb 26 '15 at 13:12

answered Feb 26 '15 at 13:06

user3528438

2,737
2
23
42

I get that "a" is a floating point and by multiplying 2^16, your converting the float to Q16 format. finally you multiply a *b and shift right by 16 to get the result. But, can you please elaborate on the "int bQ16 = 4<<16" in your example?. – PsychedGuy Mar 03 '15 at 03:28
@DigitalGeeK It's just converting 4Q0 to 4Q16, by left shifting 16, or multiplying (2^16). The code basically does 4 things: 1)converting a float to Q16; 2)converting an int from Q0 to Q16; 3) calculate the product in Q32; 4; convert Q32 to Q16. – user3528438 Mar 03 '15 at 12:46
Okay, in the 2nd) 0.1*65536 = 6553.6, (0.6 will be excluded right?, since its an integer), 3rd) 4*65536 = 262144 and 4th) 6553.6*262144/65536 = 26214.4, why is the result you mentioned as 0.3999699789.... and after truncating, isn't the value supposed to be 26214 excluding the fractional part i.e. 0.4?. Another thing is, how can I represent a time for example 0.001seconds in fixed point rep?, since if assigned to integer data type, it will only take 0 and if i convert 0.001 to Q8.8 i.e. 0.001*256=0.256, again, int data type will take this as 0 rite?. So basically the factional part is lost. – PsychedGuy Mar 06 '15 at 06:54
Sorry if my doubts are lame, just trying to figure out how these stuff work in order for me to use it in my algorithm. – PsychedGuy Mar 06 '15 at 06:57
@DigitalGeeK 0.1*65536 = 6553.6 and yes, 0.6 is lost. Depending on your rounding mode, it can be 6553 or 6554. In this example, it's "rounding to smaller integer", so it's 6553. Then later, you are not doing 6553.6*262144/65536 = 26214.4, instead, it's 6553.0*262144/65536 = 26212.0 . You see that, every time you do it, you lose some accuracy. Then comes your other question on "small numbers not representable in Q8.8": try some large Q like Q16.16 or Q8.24. The guideline is, after you have figured out the range, you can design your Q format for that variable to maximize accuracy. – user3528438 Mar 06 '15 at 12:09

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

0

If this is your question:

Q. Should I define my fixed-binary-point integer as a template, `U<int a, int b>(int number)`, or not, `U(int a, int b)`

I think your answer to that is: "Do you want to define operators that take two fixed-binary-point integers? If so make them a template."

The template is just a little extra complexity if you're not defining operators. So I'd leave it out.

But if you are defining operators, you don't want to be able to add U<4, 4> and U<6, 2>. What would you define your result as? The templates will give you a compile time error should you try to do that.

edited Jun 20 '20 at 09:12

Community

1
1

answered Feb 26 '15 at 13:09

Jonathan Mee

37,899
23
129
288

I will be using a template to make it suitable for any incoming data types. Anyways, thank you for the direction. – PsychedGuy Mar 03 '15 at 03:31

How to fix the position of binary point in an unsigned N-bit interger?

2 Answers2

Q. Should I define my fixed-binary-point integer as a template, U<int a, int b>(int number), or not, U(int a, int b)

Q. Should I define my fixed-binary-point integer as a template, `U<int a, int b>(int number)`, or not, `U(int a, int b)`