x86 decoding instruction opcode byte

Question

I'm creating an x86 decoder and I'm struggling on understanding and finding an efficient way to calculate the mnemonic of an instruction.

I know that the opcode 6 MSBs are the opcode bits, but I can't find anywhere that use those 6 bits in a mnemonic table. The only mnemonic table I find is for the whole opcode byte itself and not just the 6 MSBs.

I wanted to ask what are some efficient ways I can go on decoding the mnemonics encoded in the opcode byte, and if there're any table references using the 6 MSBs and not the whole opcode byte.

The bottom 2 bits are also part of the opcode... For example the [`jcc`](http://felixcloutier.com/x86/Jcc.html) instructions have different mnemonics for every opcode value from 0x70 to 0x7F. In fact, sometimes the `/r` field from the ModR/M byte is part of the opcode, too. (e.g. `shl` vs. `shr`). — Peter Cordes, Sep 15 '17 at 03:36
The problem with modern x86 machine code is that there *isn't* an efficient / simple way to decode it. For example, `rep nop` actually decodes as `pause`, or `rep bsf` decodes as `tzcnt` (if BMI1 is supported, otherwise it decodes as `bsf`). So you have to check for mandatory prefixes of other instructions. — Peter Cordes, Sep 15 '17 at 03:39
@PeterCordes One of the resources I was using is http://www.c-jump.com/CIS77/CPU/x86/X77_0050_add_opcode.htm I know there're exceptions as to when not the only 6 MSBs of the opcode byte represent the mnemonic but for a regular instruction it seems this way according to what they're saying. I'm asking on how for these regular cases could I use these 6 MSBs to determine the mnemonic like they did in their example. — Jorayen, Sep 15 '17 at 09:06
You mean like `const char *mnemonic = table[(uint8_t)opcode>>2];` in C? You do it like that. Although really you probably want a 256-entry table of `struct`s, where one of the members is an `enum` of what kind of instruction it is (or a function pointer to a function that will decode the rest of the bytes). — Peter Cordes, Sep 15 '17 at 09:09
@PeterCordes Yea but when looking at a mnemonic table online I couldn't find a way to decide which mnemonic is really used. For example looking at this table http://sparksandflames.com/files/x86InstructionChart.html using `const char *mnemonic = table[(uint8_t)opcode>>2];` when `opcode` is 0x6 (Push), it could be easily mistaken for 0x5 (Add) if I were to rsh 2 bytes — Jorayen, Sep 15 '17 at 09:15
Then obviously you can't use a 6-bit table index. Or if you do, some entries need to disambiguate. Maybe use a special byte in the first position of a string, so the string you actually print is `mnemonic+1` for entries that don't need special handling. You need to tell the difference between `add r/m, r` and `add r, r/m` anyway, so you should probably just use 256-entry tables with a struct. Or a table of pointers to structs, anyway, if your struct is big enough to be worth extra indirection to save duplication for all the `push`/`pop`/`inc`/`dec`/`xchg eax` encodings. — Peter Cordes, Sep 15 '17 at 09:24
Thanks for the link to http://www.c-jump.com/CIS77/CPU/x86/index.html, BTW. That's a pretty nice tutorial-style intro to instruction encoding. Much more beginner-friendly than Intel's manuals. Adding a link to it in [the x86 tag wiki](https://stackoverflow.com/tags/x86/info) — Peter Cordes, Sep 15 '17 at 09:27
I differentiate between those reading the direction flag in the opcode byte. But isn't there an efficient way to store a table for the mnemonics without duplicates? Also I'm still trying to figure the whole point of these 6 bits, if I can't decide what mnemonic is used with them, why should I decode them anyway? The only reliable way to get the mnemonics for now seems to be making a table of 256 entries for each byte value with duplicated mnemonics string, I think it's really inefficient and I would like to know if there's a better way of achieving it — Jorayen, Sep 15 '17 at 09:31

Peter Cordes · Accepted Answer · 2017-09-15T11:27:41.343

1

But isn't there an efficient way to store a table for the mnemonics without duplicates?

This has become an algorithms and data structures question.

As you point out, many of the opcode table entries (at least for the table without a 0f escape byte: http://sparksandflames.com/files/x86InstructionChart.html) do repeat in groups of 4 or 2, i.e. with the same 6 or 7-bit prefix selecting the same mnemonic.

Obviously a 256-entry table of structs is simple, but duplicates things. It's very fast and easy to use, since it's probably still small enough not to cache-miss very often. (Especially since the common entries will stay hot in cache; x86 code uses the same opcodes a lot.)

You can trade simplicity / performance for space.

You could have a 64-entry table of structs where one member is a pointer to a secondary table to be indexed with the low 2 bits. If the pointer is NULL, it means the instruction follows the pattern of add / and / xor / etc. where the low 2 bits tell you 8 bit vs. whatever the operand-size is and direction (r/m,reg or reg,r/m).

Your struct would also need entries for turning into other instructions when certain prefixes are present (e.g. rep nop is pause). Also, AVX VEX prefixes use what used to be an invalid encoding of another instruction. x86 is pretty crazy to decode if you want to do a complete job for all the current extensions.

Actually, it might be simplest (and also efficient) to just use a table of function pointers. Or a struct with a const char* mnemonic and a int (*decode)(const char*mnemonic, const char *insn_bytes, unsigned prefix_bitmap) function, so lots of opcodes can point to the same decode-function but still get different mnemonics. Sometimes the function will ignore the passed mnemonic, but other times that's all it needs. You'd have a common function for decoding addressing modes that many of the decode functions would call.

This is fairly similar to how you might implement an x86 emulator that interprets, instead of doing dynamic recompilation. A common decode loop and then dispatching through function pointers.

An even more complicated data structure you might use is a radix trie aka prefix tree. See also https://en.wikipedia.org/wiki/Trie#Bitwise_tries.

This is getting into silly season, because the density is so high that a lookup table makes much more sense. (There are very few undefined opcode).

edited Sep 15 '17 at 11:27

answered Sep 15 '17 at 10:00

Peter Cordes

328,167
45
605
847

So if I aim for performance (speed wise) would you say just storing a 256 entry table would be the best choice? Also why would I need a struct? I thought maybe creating an enum of all mnemonics and then create an the 256 entry table as an index table to this enum, what's your thoughts? – Jorayen Sep 15 '17 at 10:29
Yeah, I'd guess that a 256 entry table would perform the best. Extra branching to choose extra decode steps is unlikely to be worth it. – Peter Cordes Sep 15 '17 at 10:32
@Jorayen: Unless you have a separate table for other special cases, you'd use a struct to hold the mnemonic and tell you how to decode the remaining bytes into operands. (e.g. `jcc`/`call` vs. `add` vs. `mul r/m32` vs. `imul r,r/m32,imm32` vs. other special cases). And for special cases where 3 more opcode bits come from the `/r` field in ModR/M, to indicate that (e.g. with a pointer to another table). Having one struct where you use most of the members on every access gives good spatial locality, so it caches well. – Peter Cordes Sep 15 '17 at 10:34
Yea I wanted to ask you about the extra branching thing, since not all opcode seems to follow a similar pattern, since there're the alternate encoding for instruction using the accumulator register for example or pushing/popping segments from/to the stack, such as the `0x04` opcode which is an `add *AL*, imm8` although according to the normal decoding steps by decoding `0x04` one would expect that this is an `add` operation to a memory location since the `direction` flag is set to 0, and not an imm to reg add operation since the MSB isn't set to 1. What are your thoughts on dealing with these – Jorayen Sep 15 '17 at 10:37
@Jorayen: Well there's another pattern for immediate operand instructions, using the `/r` field in modr/m to pack them into only a few opcodes. I'd use a 256-entry table indexed by opcode byte (and another for opcodes that follow a `0f` escape byte). For a given opcode, first check if prefixes make it into another instruction. Then check if a function-pointer is non-NULL. If so, call it (with the instruction bytes, mnemonic and `struct*` as args, taking instruction length as return value). If not, check a flag in the struct for the /r = opcode bits pattern. – Peter Cordes Sep 15 '17 at 10:46
Actually probably using function pointers is a good way to dispatch this. All opcodes with the same pattern can share the same decode function. This is pretty similar to how you'd implement an x86 interpreter / emulator. (http://www.emulators.com/docs/nx25_nostradamus.htm is interesting reading. Darek Mihocka has been optimizing CPU emulators for a long time, and has some interesting things to say about it.) – Peter Cordes Sep 15 '17 at 10:47
So just to make sure I got you right, basically a 256 entry table with a struct holding the mnemonic enum, and a function pointer which is responsible for decoding the rest of the bytes of the instruction depending on the opcode byte. The function is responsible for reading prefixes deciding if we're dealing with another instruction or maybe AVX/VEX etc.. instruction, and keep reading other stuff like modRM, SIB, DISPL, IMM as needed? Also make more tables for the extended instructions with the escape characters. – Jorayen Sep 15 '17 at 10:58
@Jorayen: More or less. I'd look for VEX/REX prefixes in the main decode loop, but otherwise mostly hand off to a decode function for different patterns. Many of those functions would want to call a shared function that decodes addressing modes and returns how many bytes that consumed. – Peter Cordes Sep 15 '17 at 11:07
@Jorayen, there are several open-source disassemblers. You could look at them to see how they're implemented. (`objdump` in GNU binutils, `ndisasm` from NASM, and Agner Fog's `objconv` (http://agner.org/optimize/) are the first 3 I can think of.) – Peter Cordes Sep 15 '17 at 11:09

x86 decoding instruction opcode byte

1 Answers1