10

In architectures where not all the instructions are the same length, how does the computer know how much to read for one instruction? For example in Intel IA-32 some instructions are 4 bytes, some are 8 bytes, so it how does it know whether to read 4 or 8 bytes? Is it that the first instruction red when the machine is powered on has a known size and each instruction contains the size of the next one?

didierc
  • 14,572
  • 3
  • 32
  • 52
Celeritas
  • 14,489
  • 36
  • 113
  • 194

3 Answers3

13

First, the processor does not need to know how many bytes to fetch, it can fetch a convenient number of bytes sufficient to provide the targeted throughput for typical or average instruction lengths. Any extra bytes can be place in a buffer to be used in the next group of bytes to be decoded. There are tradeoffs in the width and alignment of fetch relative to the supported width of instruction decode and even with respect to the width of later parts of the pipeline. Fetching more bytes than average can reduce the impact of variability in instruction length and the effective fetch bandwidth related to taken control flow instructions.

(Taken control flow instructions may introduce a fetch bubble if the [predicted] target is not available until a cycle after the next fetch and reduce effective fetch bandwidth with targets that are less aligned than the instruction fetch. E.g., if instruction fetch is 16-byte aligned—as is common for high performance x86—a taken branch that targets the 16th [last] byte in a chunk will result in effectively only one byte of code being fetched as the other 15 bytes are discarded.)

Even for fixed length instructions, fetching multiple instructions per cycle introduces similar issues. Some implementations (e.g., MIPS R10000) would fetch as many instructions as could be decoded even if they are not aligned, as long as the group of instructions does not cross a cache line boundary. (I seem to recall that one RISC implementation two banks of Icache tags to allow fetch to cross a cache block—but not page—boundary.) Other implementations (e.g., POWER4) would fetch aligned chunks of code even for a branch targeting the last instruction in such a chunk. (For POWER4, 32 byte chunks were used containing 8 instructions but at most five instructions could pass decode per cycle. This excess fetch width could be exploited to save energy via cycles where no fetch is performed and to give spare Icache cycles for cache block filling after a miss while only having one read/write port to the Icache.)

For decoding multiple instructions per cycle, there are effectively two strategies: speculatively decode in parallel or wait for the length to be determined and use that information to parse the instruction stream into separate instructions. For an ISA like IBM's zArchitecture (S/360 descendant), the length in 16-bit parcels is trivially determined by two bits in the first parcel, so waiting for the lengths to be determined makes more sense. (RISC V's slightly more complex length indication mechanism would still be friendly to non-speculative decode.) For an encoding like that of microMIPS or Thumb2, which only have two lengths determinable by the major opcode and for which the encoding of different length instructions is substantially different, using non-speculative decode may be preferred, especially given the likely narrow decode and emphasis on energy-efficiency, though with only two lengths some speculation may be reasonable at small decode width.

For x86, one strategy used by AMD to avoid excessive decode energy use is to use marker bits in the instruction cache indicating which byte ends an instruction. With such marker bits, it is simple to find the start of each instruction. This technique has the disadvantage that it adds to the latency of an instruction cache miss (the instructions must be predecoded) and it still requires the decoders to check that the lengths are correct (e.g., in case a jump is made into what was previously the middle of an instruction).

Intel seems to prefer the speculative parallel decode approach. Since the length of a previous instruction in a chunk to be decoded will be available after only modest delay, the second and later decoders may not need to fully decode the instruction for all starting points.

Since x86 instructions can be relatively complex, there are also often decode template constraints and at least one earlier design restricted the number of prefixes that could be used while maintaining full decode bandwidth. E.g., Haswell limits the second through fourth instructions decoded to producing only one µop while the first instruction can decode into up to four µops (with longer µop sequences using a microcode engine). Basically, this is an optimization for the common case (relatively simple instructions) at the expense of the less common case.

In more recent performance-oriented x86 designs, Intel has used a µop cache which stores instructions in decoded format avoiding template and fetch width constraints and reducing energy use associated with decoding.

8

The first bytes of each instruction indicate its length. If things were simple, the first byte would indicate the length, but there are prefixes that indicate that the next byte is the real instruction, in addition to variable-length suffixes that contain instruction operands.

The real question is, since a modern Out-Of-Order processor decodes 3 or 4 instructions each cycle, how does it know where the 2nd, 3d, … instruction start?

The answer is that it decodes all possible starting points in the current 16-byte line of code in parallel, brute-force style. I am pretty sure that the source for this remark/guess is Agner Fog, but I can't find the reference. I googled for “Agner Fog instruction decoding suspect” but apparently he spends his time suspecting things in relation to instruction decoding.

Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
  • Do all PC architectures do this by decoding the first byte? Is their a standard? I find this hard to believe considering the size of a byte technically isn't a standard. – Celeritas Jun 17 '14 at 17:04
  • 2
    @Celeritas What do you mean by “PC”? (and “standard”? and “technically”?) The IA-32 and x86-64 are pretty well standardized. Other architectures do things differently, but if you are going to have a non-IA32-non-x86-64 architecture, you might as well have a reasonable one with fixed-length instructions of, say, 32-bit each. – Pascal Cuoq Jun 17 '14 at 17:06
5

Expanding on Pascal's answer, on the x86 architecture, the very first byte indicates which category of instructions the one being decoded belongs to:

  • 1 byte length, which means that it's been read already and can be further processed,

  • 1 byte opcode with a few more bytes (the so called ModRM and SIB bytes) to indicate which operands are following (registers, memory addresses), and their operands.

  • instruction prefix, which:

    • modify the meaning of the instruction, (repetition - REP, locking semantics - LOCK)
    • indicate that the next bytes encode an instruction introduced in later iterations of the original 8086 cpu, to either extend the size of its operands to 32 or 64 bits, or redefine the opcode meaning completely.

Furthermore, depending on the mode the CPU runs in, some prefixes may or may not be valid: for instance, the REX and VEX prefixes were introduced to implement 64 bits and vector instructions respectively, but they are interpreted as such in 64 bits mode only. REX, because of its format, covers a large number of existing instructions in the original instruction set, which cannot be used anymore in 64 bits (I suppose that the VEX prefix works similarly, though I don't know anything about it). Its fields indicate the following instruction operand size, or access to extra registers only available on 64 bits (R8 to R15 and XMM8 to XMM15).

If you study the opcodes internal patterns, you'll notice that certains bits consistently indicate which category the instruction belongs to, leading to a somewhat fast decoding.

VAX is another architecture (popular from the end of the 70's up to late 80's) which sported variable length instructions, based on similar principles. For its first iterations, instructions were probably decoded sequentially, so the end of an instruction indicated the start of a new one on the next byte. As you may know, the company which made these also produced its polar opposite, the RISC Alpha CPU, which became one of the (if not the) fastest CPU of its time, with fixed length instructions, a choice certainly made in reaction to the requirements of pipelined, superscalar technologies burgeoning at the time.

didierc
  • 14,572
  • 3
  • 32
  • 52