How instructions are fetched into modern CPUs(2023)?

Question

I am learning rocketchip these days, and I have noticed the IFU(Instruction Fetch Unit) fetches instructions from ibuf instead of main memory. But I have not seen any codes about how instructions are fetched from main memory to ibuf. I consulted with some experts and got words like icache, dcache and prefetch. I want to dig into the process.

Can anyone explain the instruction fetch process in mordern CPUs? Or which books can help me understand this process? In other words, are there any books that provide a detailed explanation of the process of instruction fetching in modern processors?

Thank you so much for your assistance!

I have found some information online, but I suspect that what I obtained may not be systematic. Therefore, I would like to learn the entire process systematically.

Does this answer your question? [How is a 15 bytes instruction transferred form memory to CPU?](https://stackoverflow.com/questions/54917136/how-is-a-15-bytes-instruction-transferred-form-memory-to-cpu) It's about modern *x86* CPUs in particular, which are somewhat of a special case since most other superscalar pipelined CPUs use fixed-width instructions. (Or only 2 different lengths, like ARM Thumb.) — Peter Cordes, Jun 20 '23 at 08:24
Thanks for your comment. I learned a lot from your link! And I want to find more detailed tutorials such as talkings about how prefetching unit works. — Teng Wu, Jun 20 '23 at 08:45
Depends on the CPU. For Intel CPUs specifically, their [optimization manual](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html#inpage-nav-5) documents some stuff about how the HW prefetchers work, especially the L2 streamer. For more computer-architecture background in general, see [Modern Microprocessors A 90-Minute Guide!](https://www.lighterra.com/papers/modernmicroprocessors/). For more x86 stuff, see https://stackoverflow.com/tags/x86/info — Peter Cordes, Jun 20 '23 at 12:00

score 1 · Accepted Answer · answered Jul 09 '23 at 17:27

The exact details of how a particular CPU fetches its instructions would probably be behind a NDA as each processor manufacturer would have its own circuit for the fetch unit. So it's not possible for me to comment on a particular CPU. However, at a very high level, the front-end (the stages which are responsible for instruction fetch and decode) of modern processors consists of pre-fetchers, instruction caches (I-Cache) and branch predictors.

Various CPUs may or may not have these three components depending on the type of applications they are designed for. For example, a simple processor for a toy may not need these structures and it may directly access the memory to fetch the instructions. On the other hand, a processor made for high performance computing tasks may have multiple pre-fetchers and branch predictors along with a potentially multi level I-cache. So the exact architecture of the front-end depends on what the processor is designed for. For the rest of this answer, I'm assuming that you are talking about a processor which is designed for high performance or desktop computing. Moreover, please keep in mind that the following explanation may not hold for every processor and that it is just a high level view of things.

Modern processors, on the outside, follow Von Neumann architecture which means that they expect the data for a program and its instructions to be stored in a single memory. The RAM in you computer acts as this memory. The CPU asks the RAM for instructions/data by providing an address, and the RAM returns the binary values stored at the specified address. Note that the RAM does not distinguish between instructions and data. To the RAM, everything is just a bunch of binary values. Once these instructions/data reach the CPU, they end up in the last level cache (LLC). The LLC serves as a small but fast storage for the CPU. Next, the instructions/data are forwarded to the next level of the cache hierarchy which is typically the level 2 (L2) cache. Up till the L2 cache, there is no distinction between data and instructions. Now the L2 cache forwards the data to the level 1 (L1) cache. The L1 cache, on the other hand, is divided into two sub-parts which are called the data cache (D-Cache) and the instruction cache (I-cache). From the L1-cache onwards, the processor follows the Harvard architecture. Once the data reaches the D-Cache and the instructions reach the I-cache, the execution unit of the CPU can start accessing the instructions and the data.

The instructions are accessed by querying the I-cache. The I-cache takes the address of the instruction as the input and returns the instruction which is supposed to be present at the specified address. However, even though the I-cache is pretty fast (relative to other kinds of memory in a system), it may still take 10s of cycles to respond to the execution unit (due to something called cache misses, but that is beyond the scope of this explanation). This means that the CPU will only be able to execute instruction every 10s of cycles.

Thus, to mitigate this issue, computer architects devised pre-fetchers. As the name suggests, a pre-fetcher will fetch the instruction and store it into the I-cache before it is even required. This means that even though the execution unit has not accessed a particular address, the pre-fetcher will still make a request for that address to the I-cache. To put it simply, the pre-fetcher tries to predict what instruction would be executed next and tries to get it into the I-cache. However, due to the limitations of pre-fetchers, they are often very bad at predicting certain kind of instructions.

One example of such instructions are instructions which follow a branch instruction. When the execution unit encounters a branch instruction, it must first resolve the branch, i.e. execute the branch code, to figure out which direction the program flow will go before it can figure out the address of the next instruction. For example, if you have an if condition in your code, until you can compute whether the condition would be taken or not, you wouldn't know which instruction will be executed next. However, due to the deeply pipelined nature of modern processors, the processor may take 100s of cycles to resolve the branch. This is called the branch penalty. During these cycles, the front-end of the processor will be stalled, i.e. it would not be able to fetch any instruction, as it would not know from where it has to fetch the next instruction. This will make the performance of the processor much worse for programs with lots of branches. As it turns out, 5-10% of instructions of most programs are branch instructions. Therefore, to handle this issue, the computer architects designed branch predictors. As the name suggests, these structures try to predict the outcome and the direction of branches before they are resolved. Modern branch predictors are more than 99% accurate for many applications. Thus modern processors only have to pay the huge branch penalty for around 1% of all the branch instructions for most programs.

Thus, with the help of branch predictors and pre-fetchers, modern processors are able to ensure that for most of the execution flow, the instructions will be in the I-cache. This, in turn, speeds up the instruction fetch stage improving the overall performance of the processor.

Note that I've skipped over a lot of very fascinating details in this explanation to keep it short. If you are interested in this sort of stuff, you may want to look at courses which teach computer architecture. A good book for this subject is Computer Architecture: A Quantitative Approach by David A Patterson and John L. Hennessy.

Internals of modern x86 CPUs are surprisingly well documented, including in Intel's official optimization manual, not just 3rd-party stuff from reverse engineering and patents (a lot of public patents from Intel shed light on how their CPUs work). The secret sauce is in stuff like branch prediction and prefetch algorithms, and how they design the logic that can get the work done fast with low power. The rest is mostly known; see https://www.realworldtech.com/sandy-bridge/ and [Agner Fog's](https://agner.org/optimize/) microarch pdf, and https://en.wikichip.org/wiki/amd/microarchitectures/zen_2 — Peter Cordes, Jul 09 '23 at 17:48
*This means that the CPU will only be able to execute instruction every 10s of cycles.* - Modern high-performance desktop CPUs can and do run an average of more than 1 instruction per cycle. I think you meant to say it *would* be this slow if every instruction fetch was an L1i miss, and if fetch wasn't pipelined so you were bottlenecked on the latency instead of throughput of L2 hits. And you'll still get L1i hits within basic blocks, and fetching 16 bytes at a time gets multiple instructions. — Peter Cordes, Jul 09 '23 at 17:57
Also [Modern Microprocessors A 90-Minute Guide!](https://www.lighterra.com/papers/modernmicroprocessors/) is excellent in general, although not much focus on fetch. — Peter Cordes, Jul 09 '23 at 17:57
@PeterCordes Yes, that's what I wanted to say. I do recognise that modern processors are superscalar in nature and they have a high IPC value, but to keep the answer digestible, I didn't wanted to go into those details. — Setu, Jul 09 '23 at 21:07
The fact that instruction-fetch happens in blocks of more than 1 instruction, with parallel decode, is pretty important for an answer to a question specifically about modern (2023) CPUs, IMO. Also, the decoded-uop cache is *hugely* important if talking about modern x86. — Peter Cordes, Jul 09 '23 at 23:43
And BTW, 100s of cycles to resolve a branch is usually an over-estimate, unless the branch condition is a cache-miss load or something. Even on fairly deep pipelines like Skylake, https://www.7-cpu.com/cpu/Skylake.html, the actual branch-miss latency is about 16 cycles on uop-cache hit, 20 on L1i hit. Over 100 *instructions* could be in flight in that window on Ice Lake or Alder Lake, but hundreds (plural) of *cycles* is pretty pessimistic and kind of over-states things if you don't mention why it's so high. — Peter Cordes, Jul 09 '23 at 23:44
Thank you for your insightful response. I have realized that instruction-fetch in modern CPUs is a complex process. I will take some time to study the entire process based on the information you provided. Once again, thank you very much! — Teng Wu, Jul 12 '23 at 03:38

How instructions are fetched into modern CPUs(2023)?

1 Answers1