How do direct number operands in a CPU work?

Question

To give an example: a x86_64 CPU reading a 128 bit instruction.

From what I understand, this is centainly a thing that happens in x86 processors. Otherwise it would not be possible to for instance add a 64-bit number to a 64 bit register (the opcode would take a few bits + 64 bits for the number > 64).

What I would like to know is what the limit of bits in an instruction is and how the instruction is read if it is larger than the bitness (databus). Besides, I also know most RISC CPU's use a fixed size instruction, so if you pass a number operand directly, does the instruction simply double in size?

On riscs with fixed word instruction size there's usually no instruction to load "any" immediate, only part of it, and full immediate is then built by several instructions (if it's not possible to use such immediate, that it's possible to encode it in single load). On x86 the instruction size is not fixed, and instruction is as long, as needed: `movabs rax,0x123456789abcdef0 = 48 B8 F0 DE BC 9A 78 56 34 12` = 10 bytes (every `movabs rax,nn` has 10B, even when loading with 0 immediate, so per opcode/operand the instruction size is fixed even on x86) — Ped7g, Oct 10 '16 at 09:41
The encodings for every instruction are right there in the manual. This HTML extract of Intel's official PDF is handy: http://www.felixcloutier.com/x86/. See also http://stackoverflow.com/tags/x86/info for other links to stuff about x86. — Peter Cordes, Oct 10 '16 at 09:53
That's good to know Ped7g. I am familiar with the manual, I just couldn't find a straight anwser. — RabbitBones22, Oct 10 '16 at 19:09

score 6 · Accepted Answer · answered Oct 10 '16 at 09:42

a x86_64 CPU reading a 128 bit instruction

That won't happen, the maximum instruction size is defined to be 15 bytes. You could construct longer instructions but they will be invalid.

You don't need 16 bytes to have an instruction that takes a 64bit immediate operand. There are only a couple of x64 instructions that even do that in the first place, for example mov r64, imm64 which is encoded as REX.W B8+r io and is thus 10 bytes. Almost all 64 bit x64 instructions that take an immediate take a sign-extended shorter immediate, 8 or 32 bits.

In RISC ISAs it's typically impossible to have an immediate as big as the word size, you'd have to construct big values in a register in two steps or load them from memory. But x64, like its x86 roots, is definitely not RISC.

I suspect this question is (partly) motivated by the mental image of instructions coming over a data bus one by one, which is nice for MIPS or such, but with variable-length instructions with no alignment requirements like you have in x86 you just can't do that - no matter what kind of block you pick, it may be (and likely is) cutting right through some instruction. So decoding is, in the simplest view, a state machine with a buffer, decoding the first instruction and dropping it from the buffer, filling more bytes when there's room (of course it's more complicated now).

score 4 · Answer 2 · edited May 23 '17 at 12:13

BTW, operand data embedded right into an instruction is called "immediate" data.

It's not how modern CPUs work, but having a data bus narrower than the longest instruction is not actually a problem.

8086 for example did have to deal with instruction encodings that are wider than its 16-bit data bus, without any L1 cache to hide that effect.

As I understand it, 8086 just keeps reading words (16 bits) into a decode buffer until the decoder sees a whole instruction at once. If there's a leftover byte, it's moved to the front of the decode buffer. Instruction fetch for the next insn actually happens in parallel with the execution of the just-decoded instruction, but code-fetch was still the major bottleneck in 8086.

So the CPU just needs a buffer as large as the largest allowed instruction (excluding prefixes). That's 6 bytes for 8086, and this is exactly the size of 8086's prefetch buffer.

The "until the decoder sees a whole instruction" is a simplification: 8086 decodes prefixes separately, and "remembers" them as modifiers. 8086 lacks the 15-byte max total insn length limitation of later CPUs, so you could fill a 64k CS segment with repeated prefixes on one instruction).

Modern CPUs (like Intel P6 and SnB families) fetch code from L1 I-cache in at least 16B chunks, and actually decode multiple instructions in parallel. @Harold's nicely covers the rest of your question.

See also Agner Fog's microarch guide, and other links from the x86 tag wiki to learn more about how modern x86 CPUs work, in detail.

Also, David Kanter's SandyBridge writeup has details of the front-end for that microarchitecture family.

If I understand correctly from your post, a x86 CPU has a seperate 'unit' which fetches instructions and decodes instruction, translates it to the micro code and feed it to processing core(s)? I already understood that a multi-core CPU needed something like this to seperate workload but it is done for single core as well? — RabbitBones22, Oct 10 '16 at 19:22
@RabbitBones22: This diagram is the front-end for a single core. Each core has its own fetch/decode hardware, and it's own private L1 I-cache. Nothing here has anything to do with multi-core, and you wouldn't change this design if you were scaling Sandybridge down to a single-core design. Anyway yes, each core has its own fetch/decode/execute pipeline, where the decode stage decodes each x86 instruction to usually one but sometimes many internal micro-ops. — Peter Cordes, Oct 10 '16 at 20:53

How do direct number operands in a CPU work?

2 Answers2