5

I need to convert float to Q31 fixed-point, Q31 meaning 1 sign bit, 0 bits for integer part, and 31 bits for fractional part. This means that Q31 can only represent numbers in the range [-1,0.9999].

By definition, when converting from float to fixed-point, a multiplication by 2ˇN is done, where N is the fractional part size, in this case 31.

However, I got confused with this code, it doesn't look right, but works:

#define q31_float_to_int(x) ( (int) ( (float)(x)*(float)0x7FFFFFFF ) )

And it seems to work OK. For example:

int a = q31_float_to_int(0.5f); 

gives Hex: 0x40000000, which is OK.

Why is the multipication here done with 2ˇ31 - 1, and not just 2ˇ31?

Danijel
  • 8,198
  • 18
  • 69
  • 133
  • 2
    `(float)0x7FFFFFFF` is `2147483648.00000`: http://ideone.com/mawlXx . Even after casting to `unsigned` the value holds: http://ideone.com/7WMeRE – mch Jan 18 '17 at 09:44
  • Hmm?? How come 0x7FFFFFFF ends up as 2147483648 and not 2147483647?? – Danijel Jan 18 '17 at 09:54
  • 4
    Because 2147483647 is not representable as a `float` and so the nearest representable number will be taken, which is 2147483648. – mch Jan 18 '17 at 10:00
  • Any idea why didn't the above code just use `(float)0x8000000` instead of `(float)0x7FFFFFFF`? – Danijel Jan 18 '17 at 10:15
  • Perhaps the author was trying to avoid overflow and/or wished to express the value `1.0` on architectures where `INT_MAX` is `0x7FFFFFFF`. Unfortunately, this solution is unlikely to be successful *or* correct. – John McFarlane Jan 18 '17 at 19:11
  • Would it help to make `float` a `double`, `(double)(x)*(double)0x7FFFFFFFULL)`? This would require casting input `x` from `float` to `double` every time. – Danijel Jun 20 '17 at 14:36
  • This can help https://stackoverflow.com/q/71361635/7224685 – mohammadsdtmnd Mar 06 '22 at 04:37

2 Answers2

3

The code above is not a good solution to convert from float to fixed point. I am guessing whoever wrote the code used the scale factor of 0x7FFFFFFF to avoid an overflow when the input is 1.0. The correct scaling factor is 2^31 and not 2^31 - 1. Note that there are also precision issues when converting a float (with 24 bits of precision) to an Q1.31 (with 31 bits of precision). Consider saturating the input data before multiplication:

const float Q31_MAX_F =  0x0.FFFFFFp0F;
const float Q31_MIN_F = -1.0F;
float clamped = fmaxf(fminf(input, Q31_MAX_F), Q31_MIN_F);

The code above will clamp input to the range of [-1.0, 1.0). The constantQ31_MAX_F is approximately 1 - (2 ^ -24), considering 24-bits of precision, and Q31_MIN_F is -1. Then you can multiply clamped by 2^31, or even better, use scalbnf, or ldexpf:

int result = (int) scalbnf(clamped, 31);

And if you want rounding:

int result = (int) roundf(scalbnf(clamped, 31)));
Ayan Shafqat
  • 98
  • 1
  • 8
1

I recently had to use STM32's CORDIC for hardware-accelerated trigonometry, and left unsatisfied with the accepted answer (and everything else I found on the web), I came up with a simpler (but slightly less precise) algorithm for Q31/F32 conversion:

#define Q31_SCALAR (float)M_PI
#define F32_TO_Q31(F) (int32_t)((fmodf((F)+Q31_SCALAR,2.f*Q31_SCALAR) + ((F)<-Q31_SCALAR?Q31_SCALAR:-Q31_SCALAR)) * ((float)(INT32_MAX+1u)/Q31_SCALAR))
#define Q31_TO_F32(Q) ((int32_t)(Q) / (float)(INT32_MAX+1u))
#define CORDIC_COS_SIN(RAD,COS_VAR,SIN_VAR) { hcordic.Instance->WDATA = F32_TO_Q31(RAD); \
  (COS_VAR) = Q31_TO_F32(hcordic.Instance->RDATA); (SIN_VAR) = Q31_TO_F32(hcordic.Instance->RDATA); }

This will map floats from [-π, +π] to approximately [INT32_MIN, INT32_MAX[. If the input value is out of range, it will be "wrapped" back into that range (e.g. -5.9π will be treated as 0.1π).

If instead you want to map [-1, +1] as per the original question, simply use the following:

#define Q31_SCALAR 1.f
AgentRev
  • 749
  • 1
  • 8
  • 20
  • 1
    `M_PI` is usually a `double` constant, so `(F)/Q31_SCALAR` is a `double` quotient followed by a `double` sum and then conversion to a `float` for the `fmodf()` call. If you want `float` math, suggest `#define 0x1.921fb54442d1846ap+1f` – chux - Reinstate Monica Jun 21 '22 at 02:19
  • "This will map floats from [-π, +π] to [INT32_MIN, INT32_MAX]" --> I did not see that when running code. Values near `M_PI` became `INT_MIN`. What was the result of your edge case tests? – chux - Reinstate Monica Jun 21 '22 at 02:21
  • Once done with `float` operations, this codes unnecessary loses precision with `(F)/Q31_SCALAR+1` when `F` is a small magnitude value. Also note loss of precision with `(float)INT32_MAX` lopping off 8 least significant digits. Overall code has quality conversion issues. – chux - Reinstate Monica Jun 21 '22 at 02:26
  • I think the goal should shift from "[-π, +π] to [INT32_MIN, INT32_MAX]" to "[-π, +π) to [INT32_MIN, INT32_MAX + 1LL)" – chux - Reinstate Monica Jun 21 '22 at 02:30
  • The precision losses and weird edge cases are all intended. My objective was to trim off cycles while still reaching a reasonably correct result. – AgentRev Jun 21 '22 at 02:35
  • OK, Yet I suspect a good precision and fast enough is possible along the lines of `((int) (fmodf((F), (float)M_PI)*((INT_MAX+1LL)/ (float) M_PI))))` or `lround()` vs. cast.. Good luck. – chux - Reinstate Monica Jun 21 '22 at 04:15
  • @chux-ReinstateMonica I settled on the last edit from my answer. I need the `2.f` and ternary to ensure the input is properly wrapped from -π to +π for every 2π interval. – AgentRev Jun 21 '22 at 20:43