ARM NEON count compare result

Question

I need to make some parallel compare under uint16x8_t vectors, and increment some local variable (counter) according to it, for example +8 increment, if all elements of vector compared as true. I implement this algorithm:

...
register int objects = 0;
uint16x8_t vcmp0,vobj;
uint32x2_t dobj;
register uint32_t temp0;
...
vobj = vreinterpretq_u16_u8(vcntq_u8(vreinterpretq_u8_u16(vcmp0))); 
vobj = vpaddlq_u8(vreinterpretq_u8_u16(vobj)); 
vobj = vreinterpretq_u16_u32(vpaddlq_u16(vobj)); 
vobj = vreinterpretq_u16_u64(vpaddlq_u32(vreinterpretq_u32_u16(vobj))); 
dobj = vmovn_u64(vreinterpretq_u64_u16(vobj));
dobj = vreinterpret_u32_u64(vpaddl_u32(dobj));
    __asm__ __volatile__
            (
             "vmov.u32  %[temp0] , %[dobj][0]               \n\t"
             "add  %[objects] ,%[objects], %[temp0], asr #4               \n\t"
             : [dobj]"+w"(dobj), [temp0]"=r"(temp0), [objects]"+r"(objects)
             :
             : "memory"
            );

...

Vector vcmp0 contains results of compare, vobj, dobj used for computation, objects is counter. I am using count of set bits and pairwise add for computation. Is there any faster way to do this work?

It would be quicker to sum the bit counts "vertically" across multiple vectors and then pairwise add as the last step (if you just want a total count of matching elements). — BitBank, Apr 05 '13 at 18:53
@BitBank Thank you for your advice. I miss this optimization and successfully apply it now. — exbluesbreaker, Apr 07 '13 at 12:28

ARM NEON count compare result

0 Answers0