2

I have a comparison result of comparison of two floating point operands as follows; What I need to do is based on the result of comparison need to perform the following: i.e:

neon_gt_res = vcgtq_f32(temp1, temp2);
if(neon_gt_res[0]) array[0] |= (unsigned char)0x01;
if(neon_gt_res[1]) array[0] |= (unsigned char)0x02;
if(neon_gt_res[2]) array[0] |= (unsigned char)0x04;
if(neon_gt_res[3]) array[0] |= (unsigned char)0x08;

But writing like this is again equivalent to multiple comparison. How do I optimally write this in neon C intrinsics.

On x86, this would be array[0] |= _mm_movemask_ps(cmp_gt_res);

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Lakshmi
  • 101
  • 1
  • 2
  • What type is `array[]`? I'm assuming it's an array of bytes, but which your C sort of implies. You'd want a vector of `1 2 4 8` which you mask with the compare result. But then you'd need a shuffle to pack that result into the low 4 bytes of a NEON register, I think. I don't know NEON very well, but probably if you can do that, you'd then want to do a 32-bit load of the array, do a packed OR, and store. – Peter Cordes Oct 04 '17 at 16:31
  • array[] is of type unsigned char. The compare result is of type int32x4_t hence I cannot use that mask. Is there any other way to help me with this – Lakshmi Oct 04 '17 at 16:57
  • But doesn't NEON have any byte-shuffle instructions you could use to pack 1 byte from each element of the compare result? – Peter Cordes Oct 04 '17 at 17:03
  • I don't know how I can do that; Each lane in that result vector is 4 bytes. Too many unpacking will not benefit the neon optimisation. – Lakshmi Oct 04 '17 at 17:13
  • `neon_gt_result &= (appropriate_vectype){1,2,4,8}; neon_gt_result2 = vpadd_appropriate_size(neon_gt_result); neon_gt_result3 = vpadd_twice_appropriate_size(neon_gt_result2); appropriate_read_modify_write_sequence(neon_gt_result3, array);` That's two extra `vpadd()` instructions and a bitwise AND in addition to your comparison. I'd say that's pretty good, I'd be more worried about the memory interaction. – EOF Oct 04 '17 at 19:28
  • Oh, I missed that these are all going into the same byte, like x86 SSE `movmskps`. I was thinking @Lakshmi wanted to update `array[0]`, `array[1]`, ... `array[3]`. I updated the title to make it more specific. – Peter Cordes Oct 04 '17 at 21:36
  • Thanks @PeterCordes. Let me try it out. – Lakshmi Oct 05 '17 at 10:09
  • Related: [SSE _mm_movemask_epi8 equivalent method for ARM NEON](https://stackoverflow.com/questions/11870910/sse-mm-movemask-epi8-equivalent-method-for-arm-neon): the 8-bit element equivalent of this, producing a bitmap of 16 single-byte elements. – Peter Cordes Mar 26 '18 at 19:16

1 Answers1

3
vmov.i32 qmask, #1
vand qres, qmask, qres
vsra.u64 qres, qres, #30
vsli.64 dres_bottom, dres_top, #2

And you have the bits you need at the four least significant bits of qres.

//////////////////////// edit

An improved version of above:

vshr.u64 qres, qres, #31
vsli.64 dres_bot, dres_top, #2
// the four LSBs already contain the bitmap, the rest is optional:
vbic.i16 dres_bot, #0xf0
// you can now use byte 0 of dres_bot as the result.
Jake 'Alquimista' LEE
  • 6,197
  • 2
  • 17
  • 25