Conversion between 64-bit and 32-bit fixed-point numbers

Question

How to convert data from Q33.31 format to Q2.30 format? I know that we need to use shift operators if both input and output are of same bit size. But how to calculate if they are of different size?

Can your data be converted without losing most significant bits? (Assuming that one lowest bit loss is not significant.) — Jongware, Mar 11 '20 at 10:25
No. Some part of the data will be lost. That's not a problem — rkc, Mar 11 '20 at 10:28
Here, I am simply adding two Q1.31 bit numbers. And i want the output to be in Q2.30 format. In order to do this, i am storing the result in 64 bit variable (Q33.31) and then trying to convert into Q2.30. But how to convert this? If i left shift the output by 31 bits, the result turns out to Q2.62 format. Again right shifting this by 32 bits results to Q34.30 / Q2.30? Is this the correct process? — rkc, Mar 11 '20 at 13:14
you don't need to do a 64-bit addition in this case. Everything can be done in 32-bit math. See my answer — phuclv, Mar 11 '20 at 16:00

phuclv · Accepted Answer · 2020-03-11T15:59:59.327

The key here is just shift the radix point to the correct place. Take a simple example from Q9.7 format to Q2.6 like this

in  9 8 7 6 5 4 3 2 1.1 2 3 4 5 6 7
out                 2 1.1 2 3 4 5 6

As you can see the output's radix point's positions is 1 to the right of the input, so we need to right shift to put it in the right position. You can also think like this: there's 1 less bit in the output's fractional part so we'll right shift 1 bit to truncate it from 7 bits to 6 bits. The 7 high bits of the integer part will be automatically truncated in C when you do an assignment to the narrower type. That's equivalent to

uint8_t out = in >> 1;

Similarly to convert from Q33.31 to Q2.30 you'll do the same: q2_30 = q33_31 >> 1

However now to get a more correct result you'll need to do a rounding step. There are many round methods but the simplest way is just round to the nearest by checking if the value is above or lower than 0.5. Like in decimal where we check the first truncated digit to see if it's >= 5 or not, in binary we check the last bit that was shifted out and add it back to the result like this

uint32_t q2_30 = (q33_31 >> 1) + (q33_31 & 1)

Edit

There's absolutely no need to do truncate to do that when you just want the sum of two Q1.31 bit numbers. Just convert them to Q2.30 using the above method, add then round later

uint32_t A2_30 = A1_31 >> 1; // types must be unsigned so that the shifts are logical
uint32_t B2_30 = B1_31 >> 1; // instead of arithmetic

// if only one of the values is 1 then their sum is 0.5 ULP which will be rounded to 1
uint32_t carry = (A1_31 & 1) | (B1_31 & 1); // if both of them are 1 then sum = 1 ULP

Q2_30 sum = A2_30 + B2_30 + carry;

score 2 · Answer 2 · answered Mar 11 '20 at 11:47

2

In a comment on @goodvibration's answer you state that you're adding two Q1.31 numbers. Given that, you know that your result is representable as Q2.31, so to convert your Q2.31 number to Q2.30 you just need to shift the result right by one bit:

uint32_t convert_q231_q230(uint64_t x)
  {
  return (uint32_t) (x >> 1);
  }

answered Mar 11 '20 at 11:47

Bob Jarvis - Слава Україні

48,992
9
77
110

If we store the result in a 32 bit variable after adding two Q1.31 values (consider overflow will happen), then the result will be corrupted right? How you are telling that the result will be in Q2.31 format? – rkc Mar 11 '20 at 13:51
1

You said you're storing the result of adding the two Q1.31 values in a 64-bit variable as a Q33.31 value. You should cast the Q1.31 vals to 64-bit *prior* to performing the addition. As far as "how do I know it will be Q2.31 format" - adding two unsigned 32 bit integers can only overflow by one bit - thus, you'd get a Q2.31 value (33 bits). – Bob Jarvis - Слава Україні Mar 11 '20 at 14:24
1

Alternatively - you could shift your Q1.31 values to the right by one bit prior to adding them, resulting in Q2.30 values. Then add these two Q2.30 values, which you know have at most a 1 in the high-order two bits, and your result will be Q2.30 without ever having to go through a 64-bit conversion. This comes at the cost of a possible loss of precision if you've shifted a one out of the low-order bits of the original values. – Bob Jarvis - Слава Україні Mar 11 '20 at 14:35

goodvibration · Answer 3 · 2020-03-11T11:08:28.297

0

How about this:

uint32_t convert(uint64_t x)
{
    uint32_t hi = (uint32_t)(x >> 32);
    uint32_t lo = (uint32_t)(x);
    if (hi >= (1 << 2) || lo >= (1 << 30))
        // handle input-too-large-or-too-accurate error and exit
    return (hi << 30) | lo;
}

Alternatively to handling erroneous input in the if statement (if you don't care about possible information-loss), you can simply return (hi << 30) | ((lo << 2) >> 2);.

edited Mar 11 '20 at 11:08

answered Mar 11 '20 at 10:32

goodvibration

5,980
4
28
61

Hi, can you explain what you are doing in the if condition? – rkc Mar 11 '20 at 11:01
1

@rkc: I'm not doing anything. I left it for you to determine, because you haven't quite specified the requirement for this scenario in your question (i.e., obviously you cannot fit every 64-bit combination into a 32-bit storage unit, so that `if` catches all those cases where you'd lose information during this conversion). – goodvibration Mar 11 '20 at 11:04
Hi, i'm just adding two Q1.31 numbers and storing the result into a 64 bit number. Later i want to convert into Q2.30 format again. – rkc Mar 11 '20 at 11:07
@rkc: I added an alternative for you to just ignore possible information-loss; see updated answer. – goodvibration Mar 11 '20 at 11:09
Converting between a “.31” format and a “.30” format is primarily shifting 1 bit right (along with rounding and handling overflow). This code shifts `hi` right 2 bits (net, first shifting right 32 then left 30) and does not shift `lo` at all. How does that make any sense? – Eric Postpischil Mar 11 '20 at 12:11
@EricPostpischil: Shifting it one bit as is may lead the highest bit (when set to 1) to contaminate the lowest bit of the integer part when ORing the two parts. But you're right in the fact that I've eliminated the most significant bit instead of the least significant one, which I've tried to avoid for cases where no bit needs to be eliminated. In any case, I added the second part of the answer later on. Do you see any issues with the first part of it (apart from not saying how data-loss should be handled, since there's no requirement specification for that in the question)? – goodvibration Mar 11 '20 at 12:13
1

If the input (parameter `x`) is a single Q33.31 number, then shifting all of it right 1 bit does not contaminate the low part; it moves a desired bit into the low bit. If the input is a 64-bit number that contains two 32-bit Q2.30 numbers (that the OP wants added), then the net shift of the high part by two bits is wrong; there should be no shift (relative to the 32-bit word) (and the parts would be added, not ORed). Either way, this answer is wrong. – Eric Postpischil Mar 11 '20 at 12:16

Conversion between 64-bit and 32-bit fixed-point numbers

3 Answers3

Edit