I have implemented 32-bit fixed point division on the TI C5515 DSP using an iterative method detailed in TI's DSPLIB. It's a 16-bit DSP, and this function is a bit of a bottleneck with some repeated 32-bit calculations, so every instruction counts.
The first part of the function works out an initial estimate for the reciprocal of the denominator. It's a linear estimate that does ±3 - 2x
(but in fixed point). The sign on the 3 is taken from the sign of the denominator. Note that the denominator is never zero.
I currently have (den
is a int32_t
):
int32_t offset = den > 0 ? 0x60000000 : -0x60000000;
This compiles to (AC0
is the offset, AC3
is the denominator):
MOV #-24576 << #16, AC0
XCCPART AC3 > #0 ||
MOV #24576 << #16, AC0
The result is used like this, in case it helps (_l[s]shl
is a [saturating] left shift, _lssub
is a saturating subtract):
int32_t est = _lsshl(_lssub(offset, _lshl(den, -1)), 1);
Can I remove the branch (XCCPART
), and reduce the number of instructions even further? I would be happy to use bitwise operations to do so, but I cannot figure out how (C5515 uses two's complement, so sign-bit copying won't work). It doesn't have to be portable (I use intrinsics elsewhere in the function), implementation defined behaviour is fine, but not undefined behaviour.