Let's remember classic RISC pipeline, which is usually studied: http://en.wikipedia.org/wiki/Classic_RISC_pipeline. Here are its stages:
- IF = Instruction Fetch
- ID = Instruction Decode
- EX = Execute
- MEM = Memory access
- WB = Register write back
In RISC you can only have load
s and store
s to work with memory. And EX
stage for memory access instruction will compute the address in memory (take address from register file, scale it or add offset). Then address will be passed to MEM
stage.
Your example, mov (%eax), %ebx
is actually a load from memory without any additional computation and it can be represented even in RISC pipeline:
IF
- get the instruction from instruction memory
ID
- decode instruction, pass "eax" register to ALU as operand; remember "ebx" as output for WB (in control unit);
EX
- compute "eax+0" in ALU and pass result to next stage MEM
(as address in memory)
MEM
- take address from EX
stage (from ALU), go to memory and take value (this stage can take several ticks to reach memory with blocking of the pipeline). Pass value to WB
WB
- take value from MEM
and pass it back to register file. Control unit should set the register file into mode: "Writing"+"EBX selected"
Situation is more complex in true CISC instruction, e.g. add (%eax), %ebx
(load word T
from [%eax]
memory, then store T+%ebx
to %ebx
). This instruction needs both address computation and addition in ALU. This can't be easily represented in simplest RISC (MIPS) pipelines.
First x86 cpu (8086) was not pipelined, it executed only single instruction at any moment. But since 80386 there is pipeline with 6 stages, which is more complex than in RISC. There is presentation about its pipeline, comparing it with MIPS: http://www.academic.marist.edu/~jzbv/architecture/Projects/projects2004/INTEL%20X86%20PIPELINING.ppt
Slide 17 says:
- Intel combines the
mem
and EX
stages to avoid loads and stalls, but does create stalls for address computation
- All stages in mips takes one cycle, where as Intel may take more than one for certain stages. This creates asymmetric performance
In my example, add
will be executed in that combined "MEM+EX
" stage for several CPU ticks, generating many stalls.
Modern x86 CPUs have very long pipeline (16 stages is typical), and they are RISC-like cpus internally. Decoder stages (3 stage or more) will break most complex x86 instructions into series of internal RISC-like micro-operations (sometimes up to 450 microoperations per instruction are generated with help of microcode; more typical is 2-3 microoperations). For complex ALU/MEM operations, there will be microop for address computation, then microop for memory load and then microop for ALU action. Microoperations will have depends between them, and planned to different execution ports.