Mapping [-1,+1] floats to Q31 fixed-point

Question

I need to convert float to Q31 fixed-point, Q31 meaning 1 sign bit, 0 bits for integer part, and 31 bits for fractional part. This means that Q31 can only represent numbers in the range [-1,0.9999].

By definition, when converting from float to fixed-point, a multiplication by 2ˇN is done, where N is the fractional part size, in this case 31.

However, I got confused with this code, it doesn't look right, but works:

#define q31_float_to_int(x) ( (int) ( (float)(x)*(float)0x7FFFFFFF ) )

And it seems to work OK. For example:

int a = q31_float_to_int(0.5f);

gives Hex: 0x40000000, which is OK.

Why is the multipication here done with 2ˇ31 - 1, and not just 2ˇ31?

`(float)0x7FFFFFFF` is `2147483648.00000`: http://ideone.com/mawlXx . Even after casting to `unsigned` the value holds: http://ideone.com/7WMeRE — mch, Jan 18 '17 at 09:44
Hmm?? How come 0x7FFFFFFF ends up as 2147483648 and not 2147483647?? — Danijel, Jan 18 '17 at 09:54
Because 2147483647 is not representable as a `float` and so the nearest representable number will be taken, which is 2147483648. — mch, Jan 18 '17 at 10:00
Any idea why didn't the above code just use `(float)0x8000000` instead of `(float)0x7FFFFFFF`? — Danijel, Jan 18 '17 at 10:15
Perhaps the author was trying to avoid overflow and/or wished to express the value `1.0` on architectures where `INT_MAX` is `0x7FFFFFFF`. Unfortunately, this solution is unlikely to be successful *or* correct. — John McFarlane, Jan 18 '17 at 19:11
Would it help to make `float` a `double`, `(double)(x)*(double)0x7FFFFFFFULL)`? This would require casting input `x` from `float` to `double` every time. — Danijel, Jun 20 '17 at 14:36

Ayan Shafqat · Accepted Answer · 2017-08-12T19:33:50.763

3

The code above is not a good solution to convert from float to fixed point. I am guessing whoever wrote the code used the scale factor of 0x7FFFFFFF to avoid an overflow when the input is 1.0. The correct scaling factor is 2^31 and not 2^31 - 1. Note that there are also precision issues when converting a float (with 24 bits of precision) to an Q1.31 (with 31 bits of precision). Consider saturating the input data before multiplication:

const float Q31_MAX_F =  0x0.FFFFFFp0F;
const float Q31_MIN_F = -1.0F;
float clamped = fmaxf(fminf(input, Q31_MAX_F), Q31_MIN_F);

The code above will clamp input to the range of [-1.0, 1.0). The constantQ31_MAX_F is approximately 1 - (2 ^ -24), considering 24-bits of precision, and Q31_MIN_F is -1. Then you can multiply clamped by 2^31, or even better, use scalbnf, or ldexpf:

int result = (int) scalbnf(clamped, 31);

And if you want rounding:

int result = (int) roundf(scalbnf(clamped, 31)));

edited Aug 12 '17 at 19:33

answered Feb 21 '17 at 16:38

Ayan Shafqat

98
1
8

Why use `0x7FFFFF00.p-31F;` rather than the largest `float` under 1? `(0x7FFFFF80.p-31F;)` Better yet, be portable: `Q31_MAX_F = nextafterf(1.0,0.0); Q31_MIN_F = -1.0f;` – chux - Reinstate Monica Feb 21 '17 at 19:40
To _round_, suggest `int32_t result = (int32_t) lround(scalbnf(clamped, 31));` – chux - Reinstate Monica Feb 21 '17 at 19:42
Thanks. Updated according to your comments. – Ayan Shafqat Feb 21 '17 at 19:53
Detail: "constant `Q31_MAX_F` is 1 - (2 ^ -24)". Typical `float` has 24-bits of precision. – chux - Reinstate Monica Feb 21 '17 at 20:05
1

Thanks for the correction. Typical [IEEE754 single precision floating point](https://en.wikipedia.org/wiki/Single-precision_floating-point_format#IEEE_754_single-precision_binary_floating-point_format:_binary32) has 23 mantissa bits, with an implied MSB, which gives 24 bits of precision. – Ayan Shafqat Feb 22 '17 at 12:05
Also 2^31 is `clamped<<31`. But what is p in hex representation of float how can I know more about it? – mohammadsdtmnd Mar 06 '22 at 05:38

AgentRev · Answer 2 · 2022-06-21T20:42:34.213

1

I recently had to use STM32's CORDIC for hardware-accelerated trigonometry, and left unsatisfied with the accepted answer (and everything else I found on the web), I came up with a simpler (but slightly less precise) algorithm for Q31/F32 conversion:

#define Q31_SCALAR (float)M_PI
#define F32_TO_Q31(F) (int32_t)((fmodf((F)+Q31_SCALAR,2.f*Q31_SCALAR) + ((F)<-Q31_SCALAR?Q31_SCALAR:-Q31_SCALAR)) * ((float)(INT32_MAX+1u)/Q31_SCALAR))
#define Q31_TO_F32(Q) ((int32_t)(Q) / (float)(INT32_MAX+1u))
#define CORDIC_COS_SIN(RAD,COS_VAR,SIN_VAR) { hcordic.Instance->WDATA = F32_TO_Q31(RAD); \
  (COS_VAR) = Q31_TO_F32(hcordic.Instance->RDATA); (SIN_VAR) = Q31_TO_F32(hcordic.Instance->RDATA); }

This will map floats from [-π, +π] to approximately [INT32_MIN, INT32_MAX[. If the input value is out of range, it will be "wrapped" back into that range (e.g. -5.9π will be treated as 0.1π).

If instead you want to map [-1, +1] as per the original question, simply use the following:

#define Q31_SCALAR 1.f

edited Jun 21 '22 at 20:42

answered Jun 21 '22 at 01:08

AgentRev

749
1
8
20

1

`M_PI` is usually a `double` constant, so `(F)/Q31_SCALAR` is a `double` quotient followed by a `double` sum and then conversion to a `float` for the `fmodf()` call. If you want `float` math, suggest `#define 0x1.921fb54442d1846ap+1f` – chux - Reinstate Monica Jun 21 '22 at 02:19
"This will map floats from [-π, +π] to [INT32_MIN, INT32_MAX]" --> I did not see that when running code. Values near `M_PI` became `INT_MIN`. What was the result of your edge case tests? – chux - Reinstate Monica Jun 21 '22 at 02:21
Once done with `float` operations, this codes unnecessary loses precision with `(F)/Q31_SCALAR+1` when `F` is a small magnitude value. Also note loss of precision with `(float)INT32_MAX` lopping off 8 least significant digits. Overall code has quality conversion issues. – chux - Reinstate Monica Jun 21 '22 at 02:26
I think the goal should shift from "[-π, +π] to [INT32_MIN, INT32_MAX]" to "[-π, +π) to [INT32_MIN, INT32_MAX + 1LL)" – chux - Reinstate Monica Jun 21 '22 at 02:30
The precision losses and weird edge cases are all intended. My objective was to trim off cycles while still reaching a reasonably correct result. – AgentRev Jun 21 '22 at 02:35
OK, Yet I suspect a good precision and fast enough is possible along the lines of `((int) (fmodf((F), (float)M_PI)*((INT_MAX+1LL)/ (float) M_PI))))` or `lround()` vs. cast.. Good luck. – chux - Reinstate Monica Jun 21 '22 at 04:15
@chux-ReinstateMonica I settled on the last edit from my answer. I need the `2.f` and ternary to ensure the input is properly wrapped from -π to +π for every 2π interval. – AgentRev Jun 21 '22 at 20:43

Mapping [-1,+1] floats to Q31 fixed-point

2 Answers2