0

I need to make some parallel compare under uint16x8_t vectors, and increment some local variable (counter) according to it, for example +8 increment, if all elements of vector compared as true. I implement this algorithm:

...
register int objects = 0;
uint16x8_t vcmp0,vobj;
uint32x2_t dobj;
register uint32_t temp0;
...
vobj = vreinterpretq_u16_u8(vcntq_u8(vreinterpretq_u8_u16(vcmp0))); 
vobj = vpaddlq_u8(vreinterpretq_u8_u16(vobj)); 
vobj = vreinterpretq_u16_u32(vpaddlq_u16(vobj)); 
vobj = vreinterpretq_u16_u64(vpaddlq_u32(vreinterpretq_u32_u16(vobj))); 
dobj = vmovn_u64(vreinterpretq_u64_u16(vobj));
dobj = vreinterpret_u32_u64(vpaddl_u32(dobj));
    __asm__ __volatile__
            (
             "vmov.u32  %[temp0] , %[dobj][0]               \n\t"
             "add  %[objects] ,%[objects], %[temp0], asr #4               \n\t"
             : [dobj]"+w"(dobj), [temp0]"=r"(temp0), [objects]"+r"(objects)
             :
             : "memory"
            );

...

Vector vcmp0 contains results of compare, vobj, dobj used for computation, objects is counter. I am using count of set bits and pairwise add for computation. Is there any faster way to do this work?

exbluesbreaker
  • 2,160
  • 3
  • 18
  • 30
  • 4
    It would be quicker to sum the bit counts "vertically" across multiple vectors and then pairwise add as the last step (if you just want a total count of matching elements). – BitBank Apr 05 '13 at 18:53
  • @BitBank Thank you for your advice. I miss this optimization and successfully apply it now. – exbluesbreaker Apr 07 '13 at 12:28

0 Answers0