Instruction decoding when instructions are length-variable

Question

Here are some instructions and their corresponding encodings:

55                      push   %ebp
89 e5                   mov    %esp,%ebp
83 ec 18                sub    $0x18,%esp
a1 0c 9f 04 08          mov    0x8049f0c,%eax
85 c0                   test   %eax,%eax
74 12                   je     80484b1 <frame_dummy+0x21>
b8 00 00 00 00          mov    $0x0,%eax
85 c0                   test   %eax,%eax
74 09                   je     80484b1 <frame_dummy+0x21>
c7 04 24 0c 9f 04 08    movl   $0x8049f0c,(%esp)

Today's microprocessors are often 32- or 64-bit and I guess that they usually read data from memory in 4 byte or 8 byte chunks. However, instructions can have variable length. How does a microprocessor decode these instructions and why are they not of constant length to ease implementation?

FYI: they actually commonly fetch 16 (Core2) or 32 (K10) bytes at once. — harold, Nov 20 '11 at 19:50
@harold, isn't instruction length restricted by word size of the processor (processor architecture) in which case max would be 32 or 64 bit (not byte) length on modern ones? — Denys S., Feb 10 '13 at 17:19
@DenysS. not necessarily and not in x86. That's more of a RISC feature. x86 instructions are currently limited to 15 bytes (which is an odd number and hard to actually reach), that limit used to be lower (but it was always larger than the word size). — harold, Feb 10 '13 at 17:33
@harold, ok but then word size must restrict instruction operands and register sizes, right? (what I got from wiki) otherwise what does it mean then? — Denys S., Feb 10 '13 at 17:40
@DenysS. it does, kind of, the general purpose registers are as wide as the word size (or,well, that's True Enough anyway) and the pointer size and "maximum immediate argument anywhere" still match those (though in 64bit mode the only 64bit immediate anywhere is in a simple load to a general purpose register and in a special load to/from offset if you want to count addresses as immediates, the immediates in ALU ops are stuck at 32bit). There are also 256bit vector registers these days though, and you can use 32bit registers in 16bit mode, so x86 is a terrible example of what "word size" means. — harold, Feb 10 '13 at 17:50
@harold, yeah, thanks though, I'll check other architectures. :-) — Denys S., Feb 10 '13 at 17:55
See also Paul Clayton's answer on [With variable length instructions how does the computer know the length of the instruction being fetched?](https://stackoverflow.com/q/24269368) for a more cpu-internals discussion of fetching in (aligned) chunks of 16 bytes for example, and then finding instruction boundaries in that internal buffer. And various superscalar (multiple instructions per clock cycle) issues. — Peter Cordes, Mar 21 '20 at 00:05

Mackie Messer · Answer 1 · 2011-11-21T01:03:12.463

There are very good reasons to have a fixed instruction length, implementation simplicity being the most important. That's why many processors do indeed have a fixed instruction length, like RISC processors and many early computers.

CISC instruction sets like x86 are desinged to be decoded sequentially (step by step) by microcode. (you can think of microcode as a kind of interpreter for CISC instructions) That was the state of the art in the early 80s when x86 was designed.

Nowadays this is a problem, because microcode is dead. x86 instructions are now broken into smaller µ-ops, not unlike RISC instructions. But to do so, the x86 instructions have to be decoded first. And current CPUs decode up to 4 instructions each cycle. Because there is no time to decode sequentially one instruction after another, this works simply by brute force. When a line is brought in from the instruction cache, many decoders decode the line in parallel. One instruction decoder at each possible byte offset. After the deocoding, the length of each instruction is known, and the processor decides which decoders actually provide valid instructions. This is wasteful, but very fast.

Variable instruction sizes introduce more headakes, e.g. an instruction can span two cache lines or even two pages in memory. So your observation is spot on. Today nobody would design a CISC instruction set like x86. However, some RISCs have recently introduced a second instruction size to get more compact code: MIPS16, ARM-Thumb, etc.

That description of x86 is slightly off. AMD CPUs annotate L1i cache with instruction *length* (aka boundaries) when the line is brought in, but actual fetch / decode starts with aligned 16 or 32-byte fetch blocks (depending on the microarchitecture). Caching decode results in a decoded-uop cache (Sandybridge and Zen families) only happens as execution reaches not-yet-decoded instructions, building uop-cache lines on the fly during regular decode. See https://www.realworldtech.com/sandy-bridge/ and https://agner.org/optimize/ (specifically Agner's microarch pdf) — Peter Cordes, Mar 11 '20 at 06:38

old_timer · Accepted Answer · 2011-11-21T18:41:48.317

EDIT: hoping to make it more readable.

The hardware does not look at the memory as a long list of unorganized bytes. All processors, fixed or variable word length, have a specific boot method. Usually a known address in the processors memory/address space with either an address to the first instruction of the boot code or the first instruction itself. From there and for every instruction the address of the current instruction is where to start decoding.

For an x86 for example it has to look at the first byte. Depending on the decoding of that byte it may need to read more opcode bytes. If the instruction required an address, offset or otherwise some form of immediate value those bytes are there as well. Very quickly the processor knows exactly how many bytes are in this instruction. If the decoding shows the instruction contains 5 bytes and it started at address 0x10 then the next instruction is at 0x10+5 or 0x15. This continues forever. Unconditional branches, which depending on the processor can come in various flavors, you dont assume the bytes that follow the instruction are another instruction. Branches, conditional or unconditional give you a clue where another instruction or series of instructions start in memory.

Note the X86 today definitely does not fetch one byte at a time when it decodes an instruction, sensible sized reads occur, probably 64 bits at a time, and the processor will pull the bytes out of that as needed. When reading a single byte from a modern processor the memory bus still does a full sized read and either presents all of those bits on the bus where the memory controller only pulls the bits it was after or it may go so far as to keep that data. You will see some processors where you may have two 32 bit read instructions at back to back addresses but only one 64 bit read happens on the memory interface.

I highly recommend you write a disassembler and/or an emulator. For fixed length instructions it is pretty easy, you just start at the beginning and decode as you go through memory. A fixed word length disassembler may help learning about decoding instructions, which is part of this process, but it wont help your understanding of following variable word length instructions and how to separate them without getting out of alignment.

The MSP430 is a good choice as a first disassembler. There are gnu tools asm and C, etc (and llvm for that matter). Start with assembler then C or take some pre-made binaries. They key is you have to walk the code like the processor, start with the reset vector and walk your way through. When you decode one instruction you know its length and know where the next instruction is until you hit an unconditional branch. Unless the programmer has intentionally left a trap to fool the disassembler, assume that all branches conditional or unconditional point at valid instructions. An afternoon or evening is all that is required to bang out the whole thing or at least get the concept. You dont necessarily need to completely decode the instruction, dont have to make this a full blown disassembler, only need to decode enough to determine the length of the instruction and determine if it is a branch and if so where. Being a 16 bit instruction you can, if you choose, one time build a table of all possible instruction bit combinations and what their lengths are, that may save some time. You still have to decode your way through branches.

Some folks might use recursion, instead I use a memory map showing me what bytes are the start of an instruction, what bytes/words are part of an instruction but not the first byte/word and what bytes I have not decoded yet. I start by taking the interrupt and reset vectors and using those to mark the starting point for instructions. and then go into a loop that decodes the instructions looking for more starting points. If a pass happens with no other starting points then I have finished that phase. If at any point I find an instruction starting point that lands in the middle of an instruction there is a problem that will require human intervention to solve. Disassembling old video game roms for example you are likely to see this, hand written assembler. Compiler generated instructions tend to be very clean and predicable. If I get through this with a clean memory map of instructions and what is left over, (assume data) I can make one pass knowing where instructions are and decode and print those out. What a disassembler for variable word length instructions sets can never do is find every instruction. If the instruction set has for example a jump table or otherwise some sort of runtime computed address for executing, you wont find all of those without actually executing the code.

There are a number of existing emulators and disassemblers out there, if you want to try to follow along instead of writing your own, I have a few myself http://github.com/dwelch67.

There are pros and cons for and against variable and fixed word length. Fixed has advantages sure, easy to read easy to decode, everything is nice and proper, but think about ram, the cache in particular, you can cram a lot more x86 instructions in the same cache as an ARM. On the other hand an ARM can decode much easier, far less logic, power, etc more bang for your buck. Historically memory was expensive logic was expensive, and a byte as you go was how it worked. a single byte opcode limited you to 256 instructions, so that expanded into some opcodes needing more bytes not to mention the immediates and addresses that made it variable word length anyway. Keep reverse compatibility for decades and you end up where you are now.

To add to all of this confusion ARM for example now has a variable word length instruction set. Thumb had a single variable word instruction, the branch, but you can easily decode this as fixed length. But they created thumb2 which really does resemble a variable word length instruction set. Also many/most of the processors that support the 32 bit ARM istructions also support 16 bit thumb instructions, so even with an ARM processor you cannot simply align the data by words and decode as you go, you have to use variable word length. What is worse the ARM to/from thumb transitions are decoded by executing, you normally cannot simply disassemble and figure out arm from thumb. A mixed mode compiler generated branch often involves loading a register with the address to branch to then using a bx instruction to branch to it, so the disassembler would need to look at the bx, look backwards in execution for the register used in the branch and hope you find a load there and hope that is the .text segment it is loading from.

Mysticial · Answer 3 · 2011-11-20T19:47:40.640

I won't be able to answer how exactly they are decoded, but I can answer why they are of variable length.

The reason for variable length is both due to the desire to keep code size small as well as unforeseen instruction set extensions.

Reducing Instruction Size

Some instructions (by nature), need a lot more space to encode. If all instructions were set at a sufficiently large fixed length to accommodate these, there would be a lot of wasted space in the instruction code. Variable length instructions allows instructions to be "compressed" down to a smaller size.

(Unforeseen) Instruction Set Extensions

The other reason is instruction set extensions. Originally, x86 only had 256 opcodes. (1 byte) Then there was a need to add more instructions, so they threw out one instruction and used its opcode as an escape character for new opcodes. The result is that the newer instructions were longer. But it was the only way to extend the instruction set and maintain backward compatibility.

As for how the processor decodes these, it's a complicated process. For each instruction, the processor needs to find the length and decode from there. This leads to an inherently sequential decoding process that is a common performance bottleneck.

Modern x86 processors have what is called a uop (micro-op) cache that caches the decoded instructions into something that's more manageable (and RISC-like) by the processor.

It's only sort of true that 8086 had 256 opcodes. It had no #UD illegal-instruction faults, so every byte sequence decoded as *something*, but some of them were undocumented. (And often decoded the same as another opcode with a similar bit-pattern, although some like SALC are unique and still undocumented.) So there weren't any *documented* instructions or encodings to throw out. 186 was even able to introduce some new 1-byte opcode instruction like `push imm8` and `push imm16` that didn't need a `0F` escape byte; that wasn't needed until later after 186 used up more of the coding space. — Peter Cordes, Mar 11 '20 at 06:43

DigitalRoss · Answer 4 · 2011-11-21T04:05:24.977

You have reinvented RISC

Hmm, the objection you are making to the classic x86 (see CISC) is exactly what motivated the designers of the RISC CPU architectures to create simple, aligned, fixed-size instruction set architectures.

It turns out that x86 these days does in fact translate the user-visible ISA to a more RISC-like micro-op stream that lives in an internal cache.

Good observation.

^{Notes.
1. Micro-ops are just one available technique. In the general case, as long as the decoding and alignment of instructions takes place in one or more pipeline stages, the actual time taken will not be added to the average instruction execution time. If branch prediction is working and the pipeline is kept full, the extra time it takes to decode and align the instructions is handled by logic executing in parallel with the actual instruction operations. With millions of gates available to today's designers they can dedicate a lot of logic to decoding the complex x86 ISA.
2. You mentioned the memory bus width; it turns out that the memory path is typically greater than either 32 or 64-bits, also. The architectural word size simply refers to the ALU and pointer size. The actual width of memory and cache interfaces is often 2x or 4x the architectural word size.}

+1 with a caveat: You mention RISC without mentioning CISC. Toss in some Wiki-links for the uninformed. — Dave, Nov 20 '11 at 20:00
modern RISC CPUs like ARM also [decode instructions into micro-ops](https://www.extremetech.com/computing/139393-arm-cortex-a15-explained-intels-atom-is-down-but-not-out). And most modern RISC architectures have alternative variable-length instruction formats to increase code density — phuclv, Sep 20 '18 at 01:22

Instruction decoding when instructions are length-variable

4 Answers4

You have reinvented RISC

Linked

Related