2

Today I discovered that GCC does some amazing magic for optimizing switches in this code:

StairsType GetStairsType(uint8_t tileId, uint8_t dlvl)
{
    if (dlvl == 0)
        return StairsType::Part;
    if (tileId == 48) {
        return dlvl >= 21 ? /* Crypt */ StairsType::Down : /* Caves */ StairsType::Part;
    }
    switch (tileId) {
    case 57: return StairsType::Down;
    // many more tile IDs go here
    }
}

Have a look at this :

https://godbolt.org/z/snY3jv8Wz

(gcc is on the left, clang is on the right)

Somehow GCC manages to compile this to 1/4 of the code compared to Clang.

  1. What does this asm do, conceptually?
  2. How does GCC do this?
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
glebm
  • 20,282
  • 8
  • 51
  • 67
  • I'm not certain what you mean by "1/4 of the code compared to Clang." The clang code is 96 bytes. (Or very close; I'm probably subtracting slightly wrong). The GCC is ~416. https://godbolt.org/z/oW3KT7ncx – Rob Napier Nov 28 '22 at 19:04
  • 2
    Clang just decided to use a jump table implementation for the `switch` instead of a sequence of conditions. This trades of some binary size but it isn't obvious (at least to me) which one is going to be faster (and that might also depend on the distribution of inputs). – user17732522 Nov 28 '22 at 19:08
  • 2
    @RobNapier - Your metrics don't appear to be including the jump table that clang produces though. glebm - It looks like clang is taking the straightforward approach of creating a jump table for the possible values in the data section and indexing into that, while GCC might be creating a bit table to determine result values instead. – owacoder Nov 28 '22 at 19:09
  • use yourself as a human debugger – 0___________ Nov 28 '22 at 19:12
  • @RobNapier I was measuring the output without "Output to binary" (13931B vs 46507B) – glebm Nov 28 '22 at 19:17
  • @user253751 Thanks! Please post this as an answer. Do you happen to know which compiler flag in GCC enables this optimization? – glebm Nov 28 '22 at 19:23
  • I edited the question to add a shortened version of the code. Ideally the question would include all of the code, and assembler output from both compilers, but I don't think it's practical. You are allowed to undo my edit, or edit it some more, if you don't like it. – user253751 Nov 28 '22 at 19:30
  • @user253751: Probably helps a lot that the `switch` applies to just 255 input values (256 since the switched on value `uint8_t`, minus one since one of them is handled before the `switch`), producing precisely five outputs; makes it a lot easier to find a mathematical equation that exhaustively covers all possibilities. – ShadowRanger Nov 28 '22 at 19:38

1 Answers1

8

Both compilers compile this part in the obvious way:

    if (dlvl == 0)
        return StairsType::Part;

When it comes to this part:

    if (tileId == 48) {
        // This ID is used by both Caves and Crypt.
        return dlvl >= 21 ? /* Crypt */ StairsType::Down : /* Caves */ StairsType::Part;
    }

gcc checks tileId==48 directly, while clang decides to merge it into the switch statement.

Both compilers decide to calculate dlvl >= 21 ? 1 : 4 branchlessly in more-or-less the same way, but with different exact instructions. GCC sneakily uses the carry flag to calculate ((dlvl >= 21 ? 0 : -1) & 3) + 1 while clang straightforwardly calculates ((dlvl < 21) * 3) + 1.

// GCC
cmp     sil, 21
sbb     eax, eax
and     eax, 3
add     eax, 1

// clang
xor     eax, eax
cmp     sil, 21
setb    al
lea     eax, [rax + 2*rax]
inc     eax

When it comes to the switch statement, clang implements it with a jump table. The lowest entry is 16 and the highest is 160, so it subtracts 16 and then checks whether the number is greater than 144. There's a well-known trick to save one check, where numbers smaller than 16 wrap around to very big numbers, so they're greater than 144. For whatever reason, it chooses to preload the number 1 (StairsType::Down) into the return value register before doing the jump.

Meanwhile on gcc, it first checks if the tile ID is between 16 and 78. If so, it subtracts 16 and checks some bitmasks:

if(tileID < 16) {
    return 0;
} else if(tileID <= 78) {
    mask = (1 << (tileID - 16));
    if(2307251517974380578 & mask) return 2;
    if(4611688220747366401 & mask) return 1;
    return ((421920768 & mask) != 0) << 2;
} else {
    tileID += 125; // The same as tileID -= 131
    // because it's being treated as a byte
    if(tileID > 29) { // values lower than 131 wrapped around
        return 0;
    }

    // notice that if we get here it means the tileID is >= 131 and <= 160
    // << only looks at the bottom 5 bits of tileID (i.e. tileID & 31)
    return -((541065279 & (1 << tileID)) != 0) & 3;
}

Let's try that again, but with the bitmasks in binary and let's figure out what those return statements return:

if(tileID < 16) {
    return StairsType::Invalid;
} else if(tileID <= 78) {
    mask = (1 << (tileID - 16));
    // tile 17, 21, 35, 36, 38, 51, 56, 64, 66, 77
    if(0010000000000101000000010000100000000000010110000000000000100010 & mask) return StairsType::Up;
    // tile 16, 39, 40, 46, 47, 57, 78
    if(0100000000000000000000100000000011000100100000000000000000000001 & mask) return StairsType::Down;
    // tile 33, 34, 37, 40, 43, 44
    return ((00011001001001100000000000000000 & mask) != 0) ? StairsType::Part : StairsType::Invalid;
} else {
    if(tileID < 131 || tileID > 160) {
        return 0;
    }

    // tile 131, 132, 133, 134, 135, 136, 153, 160
    return (00100000010000000000000000111111 & (1 << (tileID - 131))) ? StairsType::ShortcutToTown : StairsType::Invalid;
}

Seems that the compiler noticed that you grouped your tile IDs into somewhat logical groups. E.g. above 130 there are only shortcuts to town.

I have no idea how compiler writers come up with this stuff.

user253751
  • 57,427
  • 7
  • 48
  • 90