Here is my code for adding all int16x4 element in a lane:
#include <arm_neon.h>
...
int16x4_t acc = vdup_n_s16(1);
int32x2_t acc1;
int64x1_t acc2;
int32_t sum;
acc1 = vpaddl_s16(acc);
acc2 = vpaddl_s32(acc1);
sum = (int)vget_lane_s64(acc2, 0);
printf("%d\n", sum);// 4
And I tried to add all int32x4 element in a lane.
but my code looks inefficient:
#include <arm_neon.h>
...
int32x4_t accl = vdupq_n_s32(1);
int64x2_t accl_1;
int64_t temp;
int64_t temp2;
int32_t sum1;
accl_1=vpaddlq_s32(accl);
temp = (int)vgetq_lane_s64(accl_1,0);
temp2 = (int)vgetq_lane_s64(accl_1,1);
sum1=temp+temp2;
printf("%d\n", sum);// 4
Is there simply and clearly way to do this? I hope the LLVM assembly code is simply and clearly after compile it. and I also hope the final type of sum
is 32 bits.
I used ellcc cross-compiler base on LLVM compiler infrastructure to compile it.
I saw the similar question(Add all elements in a lane) on stackoverflow, but the intrinsic addv
doesn't work on my host.