AND + CMP or SHR + CMP?

Question

I’m wondering what’ll result in overall “better” code (if speed’s equal then compactness): AND-and-CMP…

#define is_foo(someuint) ((someuint & (unsigned int)~0x7FU) == 0x001B0080U)

… or SHR-and-CMP:

#define is_foo(someuint) ((someuint >> 7) == (0x001B0080U >> 7))

It’s going to be two “interlocked” operations anyway (and this check is very likely to be the first operation done on the value after loading it, most of the time looping over an array of uint), the CMP will have to wait on the previous masking. Immediate data loads are a thing to consider. I’d assume the first to generate an immediate 0xFFFFFF80 and the second 0x00003601. The first is probably loadable as signed 8-bit immediate if architectures have it (questionable); the second one unfortunately has too many bits (14) for most RISC immediate load opcodes (which tend to have 10 or 12 bits). I’m not sure if I overlooked something else relevant.

If you try both options, which one results in better code and why do you think it's better? — mkrieger1, Apr 03 '23 at 15:30
Depends on the processor architecture. The 7-shift might be encoded in the opcode, whereas the AND needs an immediate operand the size of the register. — Weather Vane, Apr 03 '23 at 15:35
@mkrieger1 “try” for two dozen or so CPUs multiplied by several compilers per CPU is going to be unwiedly, plus it’d involve knowing the assembly format and opcodes for all these. • @ Weather Vane I know. I was looking for a cross-processor average best. — mirabilos, Apr 03 '23 at 15:40
There's a good chance that optimizing compilers will generate the same code for both. — Barmar, Apr 03 '23 at 15:44
@Barmar: Indeed, clang compiles them both the same way across multiple architectures, doing whatever its cost model says is best on that architecture. https://godbolt.org/z/co75oETzM shows clang using `and edi, -128` on x86-64, where that's a 3-byte instruction since the `imm8` sign-extended immediate is cheap. But on AArch64, both compile to `cmp w8, w0, lsr #7` against a constant loaded into `w8` (then `cset` to materialize a boolean; I did that instead of making the compiler branch). GCC does make different asm, including for RISC-V, although neither one seems worse with these constants — Peter Cordes, Apr 03 '23 at 21:00
You might see a difference on MIPS where `0x3601` can fit in a 16-bit immediate but larger values can't. Indeed, yeah, the 2nd one is great on MIPS gcc, but clang makes them both bad. (https://godbolt.org/z/qsavYbcKz). Fortunately MIPS is basically obsolete, so I wondered if clang would make the same mistake for a constant that could be 12-bit on RV32. It does not, uses a shift for both, while GCC for RV32 actually does the AND and has to materialize `0x123<<7` with 2 instructions. — Peter Cordes, Apr 03 '23 at 21:08

score 1 · Accepted Answer · answered Apr 03 '23 at 21:28

As Barmar guessed, clang compiles them both the same way across multiple architectures, doing whatever its cost model says is best on that architecture. Godbolt shows clang using and edi, -128 on x86-64, where that's a 3-byte instruction since the constant fits in a sign-extended imm8. shr edi, 7 is also 3 bytes, but can't run on as many execution ports on typical modern x86 CPUs.

But on AArch64, clang compiles both to cmp w8, w0, lsr #7 against a constant loaded into w8 (then cset to materialize a boolean; I did that instead of making the compiler branch). GCC does make different asm that basically does the source operations, leading to worse asm on AArch64 where it does

# GCC12.2 -O3 for AArch64 for the & version
        and     w0, w0, -128
        sub     w0, w0, #1769472
        subs    w0, w0, #128
        cset    w0, eq
        ret

instead of

# GCC12.2 -O3 for AArch64 for the >> version
# also clang for both versions does this
        mov     w1, 13825
        cmp     w1, w0, lsr 7    // AArch64 can use shifted source operands for some insns
        cset    w0, eq
        ret

With GCC for RISC-V, neither one seems worse with these constants, since neither 0x001B0080U nor 0x001B0080U>>7 fit in 12 bits. The AND constant, ~0x7Fu aka -0x80u, does fit in a sign-extended immediate for andi.

RISC-V always sign-extends immediates. Unlike MIPS which zero-extends immediates for bitwise booleans like andi.

You might see a difference on MIPS where your 0x3601 can fit in a 16-bit immediate but larger values can't. Indeed, yeah, the 2nd one is great on MIPS gcc (srl / xori to get a 0 or non-zero value, and sltu to booleanize), but clang makes them both bad. (Godbolt).

Fortunately MIPS is basically obsolete, so I wondered if clang would make the same mistake for a constant that could be 12-bit on RV32. (Also in the last Godbolt link). Clang makes good asm for RV32, using a shift for both, while GCC for RV32 actually does the AND and has to materialize 0x123<<7 with 2 instructions.

BTW, even with the smaller immediates, clang for MIPS is so bad it misses the fact that it could have used xori instead ori $2, $zero, 0x9180 / xor reg,reg,reg; maybe someone forgot to teach clang that xori zero-extends its immediate? It does know it can't use xori for x ^ (0xffffffff<<7).

So tl;dr use shift-and-compare. I would like very much to say MIPS is not obsolete, I specifically want to consider portability across a vast range of CPUs and systems. I’ll accept this for now though. — mirabilos, Apr 05 '23 at 18:02
@mirabilos: Maybe more like "use clang", which can make the right choice for most relevant architectures, depending on the exact values of the constants and which of them can be immediates. e.g. 32-bit ARM has 8-bit immediates rotated by an even-numbered count, or something like that. But yeah, probably using a shift in the source code is better in this case. — Peter Cordes, Apr 05 '23 at 18:57
not an option; things like SUNWcc and HP aCC and DEC ucode on ULTRIX and Minix ACK and even hobbyist ones like nwcc are also valid choices here… which is why I was wondering about an overall one. — mirabilos, Apr 05 '23 at 19:16
@mirabilos: I'd have to guess that many RISC machines would prefer a shift, especially if they can't do an immediate AND with a sign-extended `-0x80u`, although your other constant is still 14 bits when its lowest set bit is shifted to the bottom. But for smaller constants yes, that could enable an immediate `cmp` or `xori`. — Peter Cordes, Apr 05 '23 at 19:22

AND + CMP or SHR + CMP?

1 Answers1