5

I'm trying to use the SSE instructions to do some image filtering. The image I'm using has a byte per pixel (255 greyscale) and I need to compare the unsigned packed bytes using a greather than comparison. I've looked into the intel's manual and the comparison exists but just for signed bytes (PCMPGTB). How could I make this comparison for the unsigned bytes? Thanks in advance

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Lautaro
  • 362
  • 3
  • 14
  • 1
    You may also be able to leverage the 'max' and 'min' operators, which are available for unsigned bytes [ but not for signed. Such is the outcome when an instruction set needs to spend dozens of bits distinguishing an instruction from legacy opcodes back to the 80's, and can only afford a few to encode the actual operation...] There are also saturating add & subtract for u8, which can sometimes be brought to bear on operations that might normally be described with unsigned comparisons. Programming SSE is definitely more interesting if you're fond of solving puzzles. – greggo Feb 02 '15 at 14:18

4 Answers4

16

It is indeed not possible to make unsigned comparisons directly, until AVX-5121.

But you can add -128 to each value (or subtract 128, or XOR 0x80, or similar). That'll turn 0 into -128, 255 into 127, and other values into values in between; the result being that you get the correct results from the comparison.

Expanding it to words should work too, but sounds a fair bit slower, since you're getting half the work done per instruction.

_mm_cmpgt_epu8(a, b) = _mm_cmpgt_epi8(
        _mm_xor_epi8(a, _mm_set1_epi8(-128)),  // range-shift to unsigned
        _mm_xor_epi8(b, _mm_set1_epi8(-128)))

pxor can run on more execution ports than paddb on some CPUs, so it's normally the best option if you need to do this. XOR is add-without-carry, and the carry-out from adding or subtracting 0x80 goes out the top of each byte element.


Footnote 1: With AVX-512BW:

vpcmpub which takes a comparison predicate as an immediate, like cmpps. _mm_cmp[eq|ge|gt|le|lt|neq]_epu8_mask compares into a mask instead of into another vector, because that's how AVX-512 compare instructions work. e.g.
__mmask16 _mm_cmpgt_epu8_mask (__m128i a, __m128i b) in Intel's intrinsics guide

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Alcaro
  • 161
  • 1
  • 3
  • 2
    This is the 'go to' method for leveraging signed compares for use as unsigned, or vice versa - definitely much better than the (currently accepted answer) approach of extending to 16 bits before comparing. But I think, given the availability of 'maxu' for bytes, mine edges it out, especially for a >= comparison as opposed to >. – greggo Feb 07 '15 at 14:32
5

The unsigned comparison (a >= b) is identical to maxu( a, b ) == a, so you can use

_mm_cmpeq_epi8( a, _mm_max_epu8(a,b))   -->   a >= b  "cmpge_epu8(a,b)"

If you need a < or > comparison, you need to invert the result, at which point Alcaro's approach may be as good (though that method needs a register to carry a constant for the inversion). But for a >= or <= comparison this is definitely better (since there's no _mm_cmple_epi8 or _mm_cmpge_epi8 to use, even after converting unsigned to signed range).

greggo
  • 3,009
  • 2
  • 23
  • 22
2

Proposing a small, but important enhancement to @greggo’s solution: The

maxu( a, b ) == a

has a drawback as you have to backup “a” before the maxu comparison, resulting in a supplementary operation, something like that:

movq mmc, mma
pmaxu mma, mmb
pcmpeq mma, mmc

The

minu( a, b ) == b

gives exactly the same effect but preserves the operators for the equality check:

pminu mma, mmb
pcmpeq mma, mmb

The gain is significant: just 2 operations instead of 3.

Zoltán Bíró
  • 346
  • 1
  • 12
1

It's not possible to do a greather than compare in packed unsigned bytes, I've unpacked the bytes into words (as they were possitive it's like a conversion from unsigned to signed and a extension from byte to word) and compared them using PCMPGTB.

Lautaro
  • 362
  • 3
  • 14
  • As I see it, to do a cmp**_epu8, this takes 7 ops: four 'unpack', two cmp**_epi16 and then a pack, and uses far more registers than the other answers. – greggo Feb 07 '15 at 14:16