2

I want to test if two SSE registers are not both zero without destroying them.

This is the code I currently have:

uint8_t *src;  // Assume it is initialized and 16-byte aligned
__m128i xmm0, xmm1, xmm2;

xmm0 = _mm_load_si128((__m128i const*)&src[i]); // Need to preserve xmm0 & xmm1
xmm1 = _mm_load_si128((__m128i const*)&src[i+16]);
xmm2 = _mm_or_si128(xmm0, xmm1);
if (!_mm_testz_si128(xmm2, xmm2)) { // Test both are not zero
}

Is this the best way (using up to SSE 4.2)?

Z boson
  • 32,619
  • 11
  • 123
  • 226
ChipK
  • 401
  • 2
  • 9
  • 20
  • What better can you hope ? At best sparing the OR. –  Oct 29 '14 at 10:24
  • 3
    I would have used `_mm_movemask_epi8` instead of `_mm_testz_si128` but actually `_mm_testz_si128` is better in general. `_mm_movemask_epi8` has a lower latency only on Nahalem and Westmere. But it's worse on Haswell. But more importantly is that it does not set the zero or carry flag in the FLAGS register but `_mm_testz_si128` does. So what you have now is probably best. – Z boson Oct 29 '14 at 19:57
  • Actually this was the type of discussion I was looking for. I would mark it as the answer but it's a comment. – ChipK Oct 29 '14 at 23:31
  • 1
    Why does this question have two down-votes, I wonder ? It looks like a pretty decent question to me. – Paul R Feb 06 '15 at 08:48
  • I wondered if PTEST could do something useful with two different unknown operands. [I posted a Q&A with my results](http://stackoverflow.com/questions/43712243/can-ptest-be-used-to-test-if-two-registers-are-both-zero-or-some-other-condition/43712244#43712244). (TL:DR, no, you have to OR your data down to one register and PTEST same,same for this kind of problem). – Peter Cordes Apr 30 '17 at 23:04

2 Answers2

3

I learned something useful from this question. Let's first look at some scalar code

extern foo2(int x, int y);
void foo(int x, int y) {
    if((x || y)!=0) foo2(x,y);
}

Compile this like this gcc -O3 -S -masm=intel test.c and the important assembly is

 mov       eax, edi   ; edi = x, esi = y -> copy x into eax
 or        eax, esi   ; eax = x | y and set zero flag in FLAGS if zero
 jne       .L4        ; jump not zero

Now let's look at testing SIMD registers for zero. Unlike scalar code there is no SIMD FLAGS register. However, with SSE4.1 there are SIMD test instructions which can set the zero flag (and carry flag) in the scalar FLAGS register.

extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
    __m128i z = _mm_or_si128(x,y);
    if (!_mm_testz_si128(z,z)) foo2(x,y);
}

Compile with c99 -msse4.1 -O3 -masm=intel -S test_SSE.c and the the important assembly is

movdqa      xmm2, xmm0 ; xmm0 = x, xmm1 = y, copy x into xmm2
por         xmm2, xmm1 ; xmm2 = x | y
ptest       xmm2, xmm2 ; set zero flag if zero
jne         .L4        ; jump not zero 

Notice that this takes one more instruction because the packed bit-wise OR does not set the zero flag. Notice also that both the scalar version and the SIMD version need to use an additional register (eax in the scalar case and xmm2 in the SIMD case). So to answer your question your current solution is the best you can do.

However, if you did not have a processor with SSE4.1 or better you would have to use _mm_movemask_epi8. Another alternative which only needs SSE2 is to use _mm_movemask_epi8

extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
    if (_mm_movemask_epi8(_mm_or_si128(x,y))) foo2(x,y);   
}

The important assembly is

movdqa      xmm2, xmm0
por         xmm2, xmm1
pmovmskb    eax, xmm2
test        eax, eax
jne         .L4

Notice that this needs one more instruction then with the SSE4.1 ptest instruction.

Until now I have been using the pmovmaskb instruction because the latency is better on pre Sandy Bridge processors than with ptest. However, I realized this before Haswell. On Haswell the latency of pmovmaskb is worse than the latency of ptest. They both have the same throughput. But in this case this is not really important. What's important (which I did not realize before) is that pmovmaskb does not set the FLAGS register and so it requires another instruction. So now I'll be using ptest in my critical loop. Thank you for your question.

Edit: as suggested by the OP there is a way this can be done without using another SSE register.

extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
    if (_mm_movemask_epi8(x) | _mm_movemask_epi8(y)) foo2(x,y);    
}

The relevant assembly from GCC is:

pmovmskb    eax, xmm0
pmovmskb    edx, xmm1
or          edx, eax
jne         .L4

Instead of using another xmm register this uses two scalar registers.

Note that fewer instructions does not necessarily mean better performance. Which of these solutions is best? You have to test each of them to find out.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • How about if (_movemask_epi8(x)|_movemask_epi8(y))? Wouldn't that create two movemskb and one 'or' command(s) - for a total of 3? – ChipK Oct 31 '14 at 15:46
  • @ChipK, you're totally right (see my updated answer). Looks like you answered your own question. It's is possible to do this without using another XMM register. – Z boson Nov 01 '14 at 16:12
  • 3
    The `pmovmskb` portion of this answer is bogus. You're only testing the sign bits: the result doesn't depend on bits 0..6 of any of the bytes! `por` / `ptest` / `jnz` is a pretty good choice. The other way is [the same number of uops](http://stackoverflow.com/questions/7989897/is-an-m128i-variable-zero/7991083#comment59454214_35890766). `por` / `pcmpeqb` (against all-zero) / `pmovmskb` / `test/jnz`. – Peter Cordes Mar 09 '16 at 16:17
1

If you use C / C ++, you can not control the individual CPU registers. If you want full control, you must use assembler.

ErmIg
  • 3,980
  • 1
  • 27
  • 40