Can I transfer sign between integers faster (on the C5515)?

Question

I have implemented 32-bit fixed point division on the TI C5515 DSP using an iterative method detailed in TI's DSPLIB. It's a 16-bit DSP, and this function is a bit of a bottleneck with some repeated 32-bit calculations, so every instruction counts.

The first part of the function works out an initial estimate for the reciprocal of the denominator. It's a linear estimate that does ±3 - 2x (but in fixed point). The sign on the 3 is taken from the sign of the denominator. Note that the denominator is never zero.

I currently have (den is a int32_t):

int32_t offset = den > 0 ? 0x60000000 : -0x60000000;

This compiles to (AC0 is the offset, AC3 is the denominator):

        MOV #-24576 << #16, AC0
        XCCPART AC3 > #0 ||
           MOV #24576 << #16, AC0

The result is used like this, in case it helps (_l[s]shl is a [saturating] left shift, _lssub is a saturating subtract):

int32_t est = _lsshl(_lssub(offset, _lshl(den, -1)), 1);

Can I remove the branch (XCCPART), and reduce the number of instructions even further? I would be happy to use bitwise operations to do so, but I cannot figure out how (C5515 uses two's complement, so sign-bit copying won't work). It doesn't have to be portable (I use intrinsics elsewhere in the function), implementation defined behaviour is fine, but not undefined behaviour.

As a denominator of 0 is not likely, see if `int32_t offset = den >= 0 ? 0x60000000 : -0x60000000;` is better, `>=` vs `>` — chux - Reinstate Monica, Jun 18 '19 at 02:23
@chux The denominator will never be zero, I should have mentioned that. — detly, Jun 18 '19 at 03:36
Why not `den & 0x80000000`? It would test the sign bit only and might be faster than `> 0` (depending on how clever the compiler is). — Matthieu, Jun 18 '19 at 04:31
I'll try both of these suggestions, but I suspect (can't prove) that the `XCCPART` ie. a conditional execute is inhibiting reordering that the compiler might otherwise be able to do, and *that's* the part that's most interesting to me. — detly, Jun 18 '19 at 04:47
Indeed, removing branching is a *big time* improvement. Bit twiddling would be nice but need to put it down on paper to check what kind of simplifications you could do... — Matthieu, Jun 18 '19 at 04:56
@Matthieu Unfortunately I read through [The Aggregate Magic Algorithms](http://aggregate.org/MAGIC) and couldn't find an appropriate method `:|` (Note that the C5515 doesn't have branch prediction, but the compiler can reorder and parallelise certain arithmetic operations within functions, which would help here). — detly, Jun 18 '19 at 04:59
I don't know the C5515 but some DSPs have different penalties if the test branches or not. You could compute the worst-case result and *then* test the sign with the best-case correcting the result. If I'm not clear I can expand that idea in an answer with more details — Matthieu, Jun 18 '19 at 05:06
As a side note, this is vulnerable and dangerous code: `den > 0 ? 0x60000000 : -0x60000000;`. As it happens, the hex literal `0x60000000` is (signed) `long` on your system, but any hex literal with the MSB set, such as `0x80000000`, would result in an operand that is of type `unsigned long`. Subtle as sin, the following expression will always result in a positive value: `den > 0 ? 0x80000000 : -0x60000000;`. This is because the `?:` operator applies implicit type promotion on the 2nd and 3rd operands, so `-0x60000000` gets promoted to an unsigned type, because of `0x80000000`. — Lundin, Jun 18 '19 at 07:37
Best way to avoid bugs like these is to always type your constants explicitly. Avoid "magic numbers" but use `const` variables when possible. — Lundin, Jun 18 '19 at 07:39
@Lundin good catch, thanks. I had them as hex constants so I could try tricks with bit manipulation and "see" obvious errors, but I should set them back to decimal literals for the final code. — detly, Jun 18 '19 at 10:45

score 1 · Answer 1 · answered Jun 18 '19 at 05:14

If your DSP has different penalties for test branching, you could compute the "worst case branching" result, then test the sign in the "best case branching" and correct the result by adding/substracting 6 to the ealier pre-computed result. That would minimize the impact of branching (though you mentioned there is no branch prediction on your DSP, the jumps still take some cycles).

For example, if it costs less to not enter the if body:

res = 3 - 2*x; // Consider den > 0
if (den & 0x80000000)
    res -= 6; // Back to -3-2*x

Or if it costs less to enter the if:

res = -3 - 2*x;
if (den & 0x80000000 == 0)
    res += 6;

FYI I put the subsequent code in the question, it's not literally `3 - 2*x` but that doesn't make a substantial difference to your answer. I don't expect you to know or write the C5515 intrinsics for me! — detly, Jun 18 '19 at 06:15

Craig · Answer 2 · 2019-06-20T16:43:20.790

With the following function

inline int32_t SignOf(int32_t val)
{
    return (+1 | (val >> 31)); // if v < 0 then -1, else +1
}

Which should compile to something along the line of an arithemtic right shift, followed by a bitwise or with 1. E.g. on arm M0:

ASRS     R2,R1,#+31
MOVS     R0,#+1
ORRS     R0,R0,R2

You could then do

int32_t offset = SignOf(den) * 0x60000000;

Hopefully with some compiler re-ordering and parallelisation it might be faster than the branch?

EDIT:

for the specific case of +-0x60000000, this might be faster:

int32_t offset = ((den >> 1) & 0xC0000000) ^ 0x60000000;

Can I transfer sign between integers faster (on the C5515)?

2 Answers2