I need to make some parallel compare under uint16x8_t
vectors, and increment some local variable (counter) according to it, for example +8 increment, if all elements of vector compared as true. I implement this algorithm:
...
register int objects = 0;
uint16x8_t vcmp0,vobj;
uint32x2_t dobj;
register uint32_t temp0;
...
vobj = vreinterpretq_u16_u8(vcntq_u8(vreinterpretq_u8_u16(vcmp0)));
vobj = vpaddlq_u8(vreinterpretq_u8_u16(vobj));
vobj = vreinterpretq_u16_u32(vpaddlq_u16(vobj));
vobj = vreinterpretq_u16_u64(vpaddlq_u32(vreinterpretq_u32_u16(vobj)));
dobj = vmovn_u64(vreinterpretq_u64_u16(vobj));
dobj = vreinterpret_u32_u64(vpaddl_u32(dobj));
__asm__ __volatile__
(
"vmov.u32 %[temp0] , %[dobj][0] \n\t"
"add %[objects] ,%[objects], %[temp0], asr #4 \n\t"
: [dobj]"+w"(dobj), [temp0]"=r"(temp0), [objects]"+r"(objects)
:
: "memory"
);
...
Vector vcmp0
contains results of compare, vobj
, dobj
used for computation, objects
is counter. I am using count of set bits and pairwise add for computation. Is there any faster way to do this work?