It doesn't have to predict. Logically it decodes one byte at a time until it sees a complete instruction. (or dword chunks for disp32 or imm32, or other multi-byte parts of an instruction implied by an earlier byte.) The length of the instruction is implied by prefixes and the opcode + modrm + SIB bytes. After looking at those, the CPU knows for sure how many more instruction bytes it needs to fetch.
But a real CPU only has to give the illusion of doing that, and can look at later bytes as long as it eventually does the right thing if they turn out not to be part of an instruction that should execute.
Physically in a real implementation, there's no problem speculatively loading later bytes as long as you eventually do the right thing.
e.g. L1 instruction cache uses 64-byte lines, so logical execution reaching a byte means the whole 64-byte chunk of memory will be in the I-cache, even if it's also in the L1 D-cache because you did some data load instructions on other bytes in that same line.
And fetch from L1I cache isn't single bytes at a time either, of course. On modern x86, decode looks at blocks of 32 or 16 bytes to find instruction boundaries. e.g. lets look at P6-family, which doesn't have a uop cache so it does always fetch/decode from L1I.
From Agner Fog's microarch PDF, in the PPro/PII/PIII section:
6.2 Instruction fetch
Instruction codes are fetched from the code cache in aligned 16-byte chunks into a double
buffer that can hold two 16-byte chunks. The purpose of the double buffer is to make it
possible to decode an instruction that crosses a 16-byte boundary (i.e. an address divisible
by 16). The code is passed on from the double buffer to the decoders in blocks which I will
call IFETCH blocks (instruction fetch blocks). The IFETCH blocks are up to 16 bytes long. In
most cases, the instruction fetch unit makes each IFETCH block start at an instruction
boundary rather than a 16-byte boundary. However, the instruction fetch unit needs
information from the instruction length decoder in order to know where the instruction
boundaries are. If this information is not available in time then it may start an IFETCH block
at a 16-byte boundary. This complication will be discussed in more detail below.
The pre-decoder pipeline stage then finds instruction boundaries (assuming they're all valid instructions), then (following the predictions made by the branch-prediction unit) the machine code for 3 instructions is sent to the 3 decoders in parallel. (Core2 widened to 4 decoders, Skylake widened to 5 decoders even though the pipeline width stays at 4 uops wide).
If there's an illegal instruction in there somewhere (or an unconditional jmp
, or a jcc
that happens to be taken), then later "instructions" are meaningless and get discarded upon discovery of that fact.
https://www.realworldtech.com/nehalem/5/ talks about the decode stages in Nehalem, the last generation of P6-family microarchitectures. But Agner Fog's description is probably more useful in terms of understanding how the CPU can look at a bunch of bytes, and then end up only using the ones that should logically be executed as instructions.