Add all elements in a lane

Question

Is there an intrinsic which allows one to add all of the elements in a lane? I am using Neon to multiply 8 numbers together, and I need to sum the result. Here is some paraphrased code to show what I'm currently doing (this could probably be optimised):

int16_t p[8], q[8], r[8];
int32_t sum;
int16x8_t pneon, qneon, result;

p[0] = some_number;
p[1] = some_other_number; 
//etc etc
pneon = vld1q_s16(p);

q[0] = some_other_other_number;
q[1] = some_other_other_other_number;
//etc etc
qneon = vld1q_s16(q);
result = vmulq_s16(p,q);
vst1q_s16(r,result);
sum = ((int32_t) r[0] + (int32_t) r[1] + ... //etc );

Is there a "better" way to do this?

score 5 · Answer 1 · answered Jul 10 '15 at 05:05

If you're targeting the newer arm 64 bit architecture, then ADDV is just the right instruction for you.

Here's how your code will look with it.

qneon = vld1q_s16(q);
result = vmulq_s16(p,q);
sum = vaddvq_s16(result);

That's it. Just one instruction to sum up all of the lanes in the vector register.

Sadly, this instruction doesn't feature in the older 32 bit arm architecture.

Marat Dukhan · Answer 2 · 2012-08-29T08:12:31.880

0

Something like this should work pretty optimal (caution: not tested)

const int16x4_t result_low = vget_low_s16(result); // Extract low 4 elements
const int16x4_t result_high = vget_high_s16(result); // Extract high 4 elements
const int32x4_t twopartsum = vaddl_s16(result_low, result_high); // Extend to 32 bits and add (4 partial 32-bit sums are formed)
const int32x2_t twopartsum_low = vget_low_s32(twopartsum); // Extract 2 low 32-bit partial sums
const int32x2_t twopartsum_high = vget_high_s32(twopartsum); // Extract 2 high 32-bit partial sums
const int32x2_t fourpartsum = vadd_s32(twopartsum_low, twopartsum_high); // Add partial sums (2 partial 32-bit sum are formed)
const int32x2_t eightpartsum = vpadd_s32(fourpartsum, fourpartsum); // Final reduction
const int32_t sum = vget_lane_s32(eightpartsum, 0); // Move to general-purpose registers

edited Aug 29 '12 at 08:12

answered Aug 29 '12 at 05:27

Marat Dukhan

11,993
4
27
41

`const int32x2_t eightpartsum= vpadd_s32(fourpartsum)` doesn't work. I think it should be `const int64x1 eightpartsum = vpaddl_s32(fourpartsum)`. Nevertheless, I change it and it compiles but it's actually much slower than my previous method.... – NOP Aug 29 '12 at 05:57
I don't think you'll see much improvement unless this is in a tight loop. I've tested this and in a loop of 100000, I get around %40 improvement in general. It might be also important that with neon intrinsics, it is quite important to use a recent compiler. I've used gcc 4.7.1. – auselen Aug 29 '12 at 07:09
vpadd_s32(fourpartsum) should be vpadd_s32(fourpartsum, fourpartsum). I edited the post to fix it. – Marat Dukhan Aug 29 '12 at 08:13

score 0 · Answer 3 · edited Jun 27 '18 at 05:53

0

temp = vadd_f32(vget_high_f32(variance_n), vget_low_f32(variance_n)); 
sum  = vget_lane_f32(vpadd_f32(variance_temp, variance_temp), 0);

edited Jun 27 '18 at 05:53

4b0

21,981
30
95
142

answered Jun 27 '18 at 05:51

Shailendra Yadav

1

1

Can you add some explainations? – aloisdg Jun 27 '18 at 12:07

Add all elements in a lane

3 Answers3

Linked