3

Here is my code for adding all int16x4 element in a lane:

#include <arm_neon.h>
...
int16x4_t acc = vdup_n_s16(1);
int32x2_t acc1;
int64x1_t acc2;
int32_t sum;
acc1 = vpaddl_s16(acc);
acc2 = vpaddl_s32(acc1);
sum = (int)vget_lane_s64(acc2, 0);
printf("%d\n", sum);// 4

And I tried to add all int32x4 element in a lane.

but my code looks inefficient:

#include <arm_neon.h>
...
int32x4_t accl = vdupq_n_s32(1);
int64x2_t accl_1;
int64_t temp;
int64_t temp2;
int32_t sum1;
accl_1=vpaddlq_s32(accl);
temp = (int)vgetq_lane_s64(accl_1,0);
temp2 = (int)vgetq_lane_s64(accl_1,1);
sum1=temp+temp2;
printf("%d\n", sum);// 4

Is there simply and clearly way to do this? I hope the LLVM assembly code is simply and clearly after compile it. and I also hope the final type of sum is 32 bits.

I used ellcc cross-compiler base on LLVM compiler infrastructure to compile it.

I saw the similar question(Add all elements in a lane) on stackoverflow, but the intrinsic addv doesn't work on my host.

Community
  • 1
  • 1
Shun
  • 45
  • 1
  • 7
  • Why do you feel your code looks inefficient? I see no complicated looping or branching. Just sequential. And is there a problem other than efficiency that you are trying to solve? i.e. does the code do what it is supposed to do? – ryyker Nov 30 '16 at 14:14
  • Because LLVM assembly code is complicated after compile it, so I'd like to know is there easier way to do this such as an neon intrinsic. – Shun Nov 30 '16 at 14:21
  • The problem for my goal is auto generate LLVM-IR through a pass, it is difficult for me if the LLVM-IR code is complicated. – Shun Nov 30 '16 at 14:27

1 Answers1

4

If you only want a 32-bit result, presumably either intermediate overflow is unlikely or you simply don't care about it, in which case you could just stay 32-bit all the way:

int32x2_t temp = vadd_s32(vget_high_s32(accl), vget_low_s32(accl));
int32x2_t temp2 = vpadd_s32(temp, temp);
int32_t sum1 = vget_lane_s32(temp2, 0);

However, using 64-bit accumulation isn't actually any more hassle, and can also be done without dropping out of NEON - it's just a different order of operations:

int64x2_t temp = vpaddlq_s32(accl);
int64x1_t temp2 = vadd_s64(vget_high_s64(temp), vget_low_s64(temp));
int32_t sum1 = vget_lane_s32(temp2, 0);

Either of those boils down to just 3 NEON instructions and no scalar arithmetic. The crucial trick on 32-bit ARM is that a pairwise add of two halves of a Q register is simply a normal add of two D registers - that doesn't apply to AArch64 where the SIMD register layout is different, but then AArch64 has the aforementioned horizontal addv anyway.

Now, how horrible any of that looks in LLVM IR I don't know - I suppose it depends on how it treats vector types and operations internally - but in terms of the final ARM machine code both could be considered optimal.

Notlikethat
  • 20,095
  • 3
  • 40
  • 77
  • Your answer is helpful! There are `shufflevector` instructions instead of type conversion instruction such as `trunc` or `sext` in my LLVM IR. Thanks a lot! – Shun Dec 01 '16 at 03:30