Is TLB used at all in the instruction fetching pipeline

Question

Is a TLB used at all in the instruction fetching pipeline?

Is this architecture / microarchitecture - dependent?

https://www.realworldtech.com/sandy-bridge/3/ shows the instruction-fetch stages of Sandybridge, Nehalem, and Bulldozer. Note the L1iTLB shown right next to L1 I-cache; it's checked for each fetch block (at least logically; the CPU may optimize by not re-checking while it stays within the same 4k page.) Anyway yes, all architectures with instructions in virtual address space use a TLB at some point, either once when fetching *into* a virtual cache or every time when fetching *from* a physical cache. — Peter Cordes, Apr 10 '18 at 21:02

score 4 · Accepted Answer · answered Apr 10 '18 at 14:49

Typically, a processor that supports paging (which typically includes a mechanism for excluding execute permission even if not separately from read permission) will access a TLB as part of instruction fetch.

A virtually tagged instruction cache would not require such even for permissions checks as long as permissions were checked when a block is inserted into the instruction cache (which typically would involve a TLB access, though a permission cache could be used with a virtually tagged L2 cache; this includes prefetches into the instruction cache), the permission domain was included with the virtual tag (typically the same as an address space identifier, which is useful anyway to avoid cache flushing), and system software ensured that blocks were removed when execute permission was revoked (or the permission domain/address space identifier was reused for a different permission domain/address space).

(In general, virtually tagged caches do not need a translation lookaside buffer; a cache of permission mappings is sufficient or permissions can be cached with the tag and an indication of the permission domain. Before accessing memory a TLB would be used, but cache hits would not require translation. Permission caching is less expensive than translation caching both because the granularity can be larger and because fewer bits are needed to express permission information.)

A physically tagged instruction cache would require address translation for hit determination, but this can be delayed significantly by speculating that the access was a hit (likely using way prediction). Hit determination can be delayed even to the time of instruction commit/result writeback, though earlier handling is typically better.

Because instruction accesses typically have substantial spatial locality, a very small TLB can provide decent hit rates and a reasonably fast, larger back-up TLB can reduce miss costs. Such a microTLB can facilitate sharing a TLB between data and instruction accesses by filtering out most instruction accesses.

Obviously, an architecture that does not support paging would not use a TLB (though it might use a memory protection unit to check that an access is permitted or use a different translation mechanism such as adding an offset possibly with a bounds check). An architecture oriented toward single address space operating systems would probably use virtually tagged caches and so access a TLB only on cache misses.

The split TLB design is particularly efficient. A TLB miss on an instruction fetch is rather disastrous. However, a separate iTLB exhibits a much lower miss rate compared to a dTLB or unified TLB. Having a low miss rate for the iTLB is so important that L1 iTLBs are made larger than L1 dTLBs and even with higher associativity (especially when there are multiple hardware threads in which case the iTLB can be statically or dynamically partitioned among the threads). Memory management at the OS or runtime level can have also a significant impact on the efficiency of the split TLB design. — Hadi Brais, Apr 10 '18 at 16:40
For example, if a runtime (like JVM or CLR) uses a single memory manager to allocate sub-page chunks to store dynamically generated code and heap objects, page mappings can be cached in both the iTLB and dTLB, thereby reducing the effectiveness of the split TLB design and making it more like a unified TLB. Fortunately, that is not the case. — Hadi Brais, Apr 10 '18 at 16:45

Is TLB used at all in the instruction fetching pipeline

1 Answers1