5

I'm interested in finding the fastest way (lowest cycle count) of comparing the values stored into NEON registers (say Q0 and Q3) on a Cortex-A9 core (VFP instructions allowed).

So far I have the following:

(1) Using the VFP floating point comparison:

vcmp.f64        d0, d6
vmrs            APSR_nzcv, fpscr
vcmpeq.f64      d1, d7
vmrseq          APSR_nzcv, fpscr

If the 64bit "floats" are equivalent to NaN, this version will not work.

(2) Using the NEON narrowing and the VFP comparison (this time only once and in a NaN-safe manner):

vceq.i32        q15, q0, q3
vmovn.i32       d31, q15
vshl.s16        d31, d31, #8
vcmp.f64        d31, d29
vmrs            APSR_nzcv, fpscr

The D29 register is previously preloaded with the right 16bit pattern:

vmov.i16        d29, #65280     ; 0xff00

My question is: is there any better than this? Am I overseeing some obvious way to do it?

Mircea
  • 1,841
  • 15
  • 18

1 Answers1

2

I believe you can reduce it by one instruction. By using the shift left and insert (VLSI), you can combine the 4 32-bit values of Q15 into 4 16-bit values in D31. You can then compare with 0 and get the floating point flags.

vceq.i32  q15, q0, q3
vlsi.32   d31, d30, #16
vcmp.f64  d31, #0
vmrs      APSR_nzcv, fpscr
BitBank
  • 8,500
  • 3
  • 28
  • 46
  • The first intruction "overwrites" the whole Q15 (i.e. D30 and D31), while the second only has D31 as a _destination_, therefore some information is lost and the comparison will not always yield the right result. – Mircea Jan 31 '12 at 21:29
  • When you use vceq.i32, it places all 1's or all 0's into each of the 4 32-bit lanes. The first instruction combines the useful info from D30 and D31 into D31 (the lower 16-bits of all 4 compares). The second instruction compares the lower 64-bits which HAS all of the useful info. – BitBank Jan 31 '12 at 21:38
  • The first instruction (i.e. vceq.i32) does not "combine" anything. Furthermore, the second one does not use D31 as an input... – Mircea Feb 01 '12 at 02:06
  • If you research the VLSI instruction, you will see that it does what you need. In this case, it takes the lower 16-bits of each 32-bit word from D30 & D31 (q15) and combines them into 4 16-bit halfwords in D31. This means that you have the comparison results of all 4 DWORDS, narrowed to 16-bits each and stored in D31. From there you can do the vcmp.f64 to set the flags. – BitBank Feb 01 '12 at 02:41
  • Page 1054 of the ARM Architecture Reference Manual (DDI0406C): "Vector Shift Left and Insert takes each element in the operand vector, left shifts them by an immediate value, and inserts the results in the destination vector. Bits shifted out of the left of each element are lost." The operand vector in your code is D30, and the destination vector is D31, so _nothing_ is read from D31. – Mircea Feb 01 '12 at 12:23
  • 1
    See this page (second diagram) http://blogs.arm.com/software-enablement/277-coding-for-neon-part-4-shifting-left-and-right/ – BitBank Feb 01 '12 at 14:08
  • 1
    OK, you're right, sorry for arguing. BUT the code still has the same NaN problem as my first implementation: if Q0 and Q3 are equal, D31 will end up with all the bits set (and therefore being NaN) => VCMP will not work. – Mircea Feb 01 '12 at 14:55