Im rather new to assembly and although the arm information center is often helpful sometimes the instructions can be a little confusing to a newbie. Basically what I need to do is sum 4 float values in a quadword register and store the result in a single precision register. I think the instruction VPADD can do what I need but I'm not quite sure.
3 Answers
You might try this (it's not in ASM, but you should be able to convert it easily):
float32x2_t r = vadd_f32(vget_high_f32(m_type), vget_low_f32(m_type));
return vget_lane_f32(vpadd_f32(r, r), 0);
In ASM it would be probably only VADD and VPADD.
I'm not sure if this is only one method to do this (and most optimal), but I haven't figured/found better one...
PS. I'm new to NEON too

- 1,295
- 8
- 14
-
thanks I managed to get this to work using one VPADD and two VADD's I was hoping to have to only use 1 or 2 instructions but i think 3 will just have to do. – A Person Aug 05 '11 at 00:08
-
Could you show your ASM? I think that it will require only one VADD and one VPADD (at least that it looks from C code) – Krystian Bigaj Aug 05 '11 at 08:30
-
I was wondering if we can directly use `vaddvq_f32`. It will directly perform addition across vector – Shailesh Oct 17 '21 at 11:54
Here is the code in ASM:
vpadd.f32 d1,d6,d7 @ q3 is register that needs all of its contents summed
vadd.f32 s1,s2,s3 @ now we add the contents of d1 together (the sum)
vadd.f32 s0,s0,s1 @ sum += s1;
I may have forgotten to mention that in C the code would look like this:
float sum = 1.0f;
sum += number1 * number2;
I have omitted the multiplication from this little piece asm of code.

- 801
- 1
- 10
- 22
It seems that you want to get the sum of a certain length of array, and not only four float values.
In that case, your code will work, but is far from optimized :
many many pipeline interlocks
unnecessary 32bit addition per iteration
Assuming the length of the array is a multiple of 8 and at least 16 :
vldmia {q0-q1}, [pSrc]!
sub count, count, #8
loop:
pld [pSrc, #32]
vldmia {q3-q4}, [pSrc]!
subs count, count, #8
vadd.f32 q0, q0, q3
vadd.f32 q1, q1, q4
bgt loop
vadd.f32 q0, q0, q1
vpadd.f32 d0, d0, d1
vadd.f32 s0, s0, s1
- pld - while being an ARM instruction and not NEON - is crucial for performance. It drastically increases cache hit rate.
I hope the rest of the code above is self explanatory.
You will notice that this version is many times faster than your initial one.

- 6,197
- 2
- 17
- 25