Neon intrinsics with complex numbers

Question

I have a lot of calculations with complex numbers (usually an array containing a struct consisting of two floats to represent im and re; see below) and want to speed them up with the NEON C intrinsics. It would be awesome if you could give me an example of how to speed up things like this:

for(n = 0;n < 1024;n++,p++,ptemp++){  // get cir_abs, also find the biggest point (value and location).
    abs_squared = (Uns32)(((Int32)(p->re)) * ((Int32)(p->re)) 
                  + ((Int32)(p->im)) * ((Int32)(p->im)));
    // ...
}

p is an array of this kind:

typedef struct {
    Int16 re;
    Int16 im;
} Complex;

I already read through chapter 12 of "ARM C Language Extensions" but still have problems in understanding how to load and store my kind of construct here to do the calculations on it.

I think it's more suitable to post it on another StackExchange site, like `Code Review` for example. — Eel Lee, Feb 18 '14 at 23:37
Did so: https://codereview.stackexchange.com/questions/42051/neon-intrinsics-with-complex-numbers — marcel, Feb 18 '14 at 23:56

Marat Dukhan · Accepted Answer · 2014-02-19T00:11:00.777

5

Use vld2* intrinsics to split re and im into different registers upon load, and then process them separately, e.g.

Complex array[16];

const int16x8x2_t vec_complex = vld2q_s16((const int16_t*)array);
const int16x8_t vec_re = vec_complex.val[0];
const int16x8_t vec_im = vec_complex.val[1];
const int16x8_t vec_abssq = vmlaq_s16(vmulq_s16(vec_re, vec_re), vec_im, vec_im);

For the above code clang 3.3 generates

vld2.16 {d18, d19, d20, d21}, [r0]
vmul.i16 q8, q10, q10
vmla.i16 q8, q9, q9

edited Feb 19 '14 at 00:11

answered Feb 18 '14 at 23:58

Marat Dukhan

11,993
4
27
41

Thanks, seems like the thing I was searching for. However I'd produce overflows with it, wouldn't I? So I'll probably do everything with the int32x4 types. – marcel Feb 19 '14 at 00:41
You may similarly access `int16x4_t` parts of `int16x8_t` and use `vaddl_s16`/`vmull_s16`/`vmlal_s16` to produce 4 32-bit results `int32x4_t` (note that this operation inputs 64-bit SIMD registers and outputs 128-bit SIMD register). – Marat Dukhan Feb 19 '14 at 01:26
E.g. `const int32x4_t vec_abssq_lo = vmlal_s16(vmull_s16(vget_low_s16(vec_re), vget_low_s16(vec_re)), vget_low_s16(vec_im), vget_low_s16(vec_im));` – Marat Dukhan Feb 19 '14 at 01:29
Unfortunately the loop with the NEON intrinsics takes even longer than the "unNEONified" loop. The store operation is the most time consuming. The loop without store takes 0,39 us with store 12,4 us ... Are there any possibilities to improve this? – marcel Feb 19 '14 at 03:13
2

Unroll to hide latency. – Marat Dukhan Feb 19 '14 at 05:10
Didn't help, I think I'll stick to the version without the NEON then. Thanks for the help again – marcel Feb 20 '14 at 03:41

Neon intrinsics with complex numbers

1 Answers1