NEON pack vector compare result into bitmap

Question

I have a comparison result of comparison of two floating point operands as follows; What I need to do is based on the result of comparison need to perform the following: i.e:

neon_gt_res = vcgtq_f32(temp1, temp2);
if(neon_gt_res[0]) array[0] |= (unsigned char)0x01;
if(neon_gt_res[1]) array[0] |= (unsigned char)0x02;
if(neon_gt_res[2]) array[0] |= (unsigned char)0x04;
if(neon_gt_res[3]) array[0] |= (unsigned char)0x08;

But writing like this is again equivalent to multiple comparison. How do I optimally write this in neon C intrinsics.

On x86, this would be array[0] |= _mm_movemask_ps(cmp_gt_res);

What type is `array[]`? I'm assuming it's an array of bytes, but which your C sort of implies. You'd want a vector of `1 2 4 8` which you mask with the compare result. But then you'd need a shuffle to pack that result into the low 4 bytes of a NEON register, I think. I don't know NEON very well, but probably if you can do that, you'd then want to do a 32-bit load of the array, do a packed OR, and store. — Peter Cordes, Oct 04 '17 at 16:31
array[] is of type unsigned char. The compare result is of type int32x4_t hence I cannot use that mask. Is there any other way to help me with this — Lakshmi, Oct 04 '17 at 16:57
But doesn't NEON have any byte-shuffle instructions you could use to pack 1 byte from each element of the compare result? — Peter Cordes, Oct 04 '17 at 17:03
I don't know how I can do that; Each lane in that result vector is 4 bytes. Too many unpacking will not benefit the neon optimisation. — Lakshmi, Oct 04 '17 at 17:13
`neon_gt_result &= (appropriate_vectype){1,2,4,8}; neon_gt_result2 = vpadd_appropriate_size(neon_gt_result); neon_gt_result3 = vpadd_twice_appropriate_size(neon_gt_result2); appropriate_read_modify_write_sequence(neon_gt_result3, array);` That's two extra `vpadd()` instructions and a bitwise AND in addition to your comparison. I'd say that's pretty good, I'd be more worried about the memory interaction. — EOF, Oct 04 '17 at 19:28
Oh, I missed that these are all going into the same byte, like x86 SSE `movmskps`. I was thinking @Lakshmi wanted to update `array[0]`, `array[1]`, ... `array[3]`. I updated the title to make it more specific. — Peter Cordes, Oct 04 '17 at 21:36
Related: [SSE _mm_movemask_epi8 equivalent method for ARM NEON](https://stackoverflow.com/questions/11870910/sse-mm-movemask-epi8-equivalent-method-for-arm-neon): the 8-bit element equivalent of this, producing a bitmap of 16 single-byte elements. — Peter Cordes, Mar 26 '18 at 19:16

Jake 'Alquimista' LEE · Answer 1 · 2017-10-17T10:56:03.187

3

vmov.i32 qmask, #1
vand qres, qmask, qres
vsra.u64 qres, qres, #30
vsli.64 dres_bottom, dres_top, #2

And you have the bits you need at the four least significant bits of qres.

//////////////////////// edit

An improved version of above:

vshr.u64 qres, qres, #31
vsli.64 dres_bot, dres_top, #2
// the four LSBs already contain the bitmap, the rest is optional:
vbic.i16 dres_bot, #0xf0
// you can now use byte 0 of dres_bot as the result.

edited Oct 17 '17 at 10:56

answered Oct 14 '17 at 12:49

Jake 'Alquimista' LEE

6,197
2
17
25

Related for packing *two* compare results into an 8-bit bitmap (if you mask them with `&1` first): https://stackoverflow.com/questions/49506114/how-to-or-all-lane-of-a-neon-vector – Peter Cordes Mar 27 '18 at 15:49
1

What about aarch64? In aarch64 v registers don't have overlapped d registers – daquexian Sep 10 '18 at 04:15
@daquexian in that case, you just do `and` with {1, 2, 4, 8} followed by `addv`. – Jake 'Alquimista' LEE Sep 16 '18 at 14:36

NEON pack vector compare result into bitmap

1 Answers1

Linked