Questions tagged [cpu-cache]

A CPU-cache is a hardware structure used by the CPU to reduce the average access memory time.

Caching is beneficial once some data elements get re-used.


Caching is a general policy
aimed at eliminating latency,
already paid for next, repetitive re-accessing some already visited
but otherwise "expensive" ( read slow ) resource ( storage )


Caching does not speed-up memory access.

The maximum a professional programmer can achieve is to pay attention and excercise a due care to allow some latency-masking in a concurrent-mode of code execution, with a carefully issued instructions well beforehand before a forthcoming memory data is indeed consumed, so that the cache-management can release a LRU part and pre-fetch the requested data from slow DRAM.


How it works?

Main memory is usually built with DRAM technology, that allows for big, dense and cheap storage structures. But DRAM access is much slower than the cycle time of a modern CPU (the so called memory wall). A CPU-cache is a smaller memory, usually built with SRAM technology (expensive, but fast) that reduces the amount of accesses to main memory by storing the main memory contents that are likely to be referenced in the near future. Caches exploit a property of programs: the principle of locality, which means adjacent memory addresses are likely to be referenced close in time (spatial locality), and if an address is referenced once, it is more likely to be referenced again soon (temporal locality). Latency pipeline for memory, disk, network, etc

The CPU cache is tagged with an address which are extra SRAM cells. These tag cells indicate the specific address that holds the data. The CPU cache can never mirror the entire system memory so this address must be stored. The index in the array forms a set. The index and the tag can use either physical or virtual (MMU) addresses; leading to the three types PIPT, VIVT, VIPT.

Modern CPUs contain multiple levels of cache. In SMP situations a CPU cache level may be private to a single CPU, a cluster of CPUs or the whole system. Because caching can result in multiple copies of data being present in an SMP system, cache coherence protocols are used to keep data consistent. The VIVT and VIPT type caches can also result in interactions with the MMU (and its cache commonly called a TLB).

Questions regarding CPU cache inconsistencies, profiling or under-utilization are on-topic.

For more information see Wikipedia's CPU-cache article.

Also: ,

1011 questions
17
votes
1 answer

Difference between PREFETCH and PREFETCHNTA instructions

The PREFETCHNTA instruction is basically used to bring the data from main memory to caches by the prefetcher, but instructions with the NT suffix are known to skip caches and avoid cache pollution. So what does PREFETCHNTA do which is different from…
Abhishek Nikam
  • 618
  • 7
  • 15
17
votes
4 answers

Does the Java Memory Model (JSR-133) imply that entering a monitor flushes the CPU data cache(s)?

There is something that bugs me with the Java memory model (if i even understand everything correctly). If there are two threads A and B, there are no guarantees that B will ever see a value written by A, unless both A and B synchronize on the same…
Durandal
  • 19,919
  • 4
  • 36
  • 70
17
votes
3 answers

What cache invalidation algorithms are used in actual CPU caches?

I came to the topic caching and mapping and cache misses and how the cache blocks get replaced in what order when all blocks are already full. There is the least recently used algorithm or the fifo algorithm or the least frequently algorithm and…
fedab
  • 978
  • 11
  • 38
17
votes
1 answer

prefetching data at L1 and L2

In Agner Fog's manual Optimizing software in C++ in section 9.10 "Cahce contentions in large data structures" he describes a problem transposing a matrix when the matrix width is equal to something called the critical stride. In his test the cost…
Z boson
  • 32,619
  • 11
  • 123
  • 226
17
votes
2 answers

Can "non-native" pointers hurt cache performance?

As far as I can tell, hardware prefetchers will at the very least detect and fetch constant strides through memory. Additionally it can monitor data access patterns, whatever that really means. Which led me to wonder, do hardware prefetchers ever…
porgarmingduod
  • 7,668
  • 10
  • 50
  • 83
17
votes
3 answers

How would you generically detect cache line associativity from user mode code?

I'm putting together a small patch for the cachegrind/callgrind tool in valgrind which will auto-detect, using completely generic code, CPU instruction and cache configuration (right now only x86/x64 auto-configures, and other architectures don't…
Niall Douglas
  • 9,212
  • 2
  • 44
  • 54
16
votes
7 answers

Design code to fit in CPU Cache?

When writing simulations my buddy says he likes to try to write the program small enough to fit into cache. Does this have any real meaning? I understand that cache is faster than RAM and the main memory. Is it possible to specify that you want…
Nope
  • 34,682
  • 42
  • 94
  • 119
16
votes
3 answers

CPU cache critical stride test giving unexpected results based on access type

Inspired by this recent question on SO and the answers given, which made me feel very ignorant, I decided I'd spend some time to learn more about CPU caching and wrote a small program to verify whether I am getting this whole thing right (most…
Andy Prowl
  • 124,023
  • 23
  • 387
  • 451
15
votes
3 answers

Where is the Write-Combining Buffer located? x86

How is the Write-Combine buffer physically hooked up? I have seen block diagrams illustrating a number of variants: Between L1 and Memory controller Between CPU's store buffer and Memory controller Between CPU's AGUs and/or store units Is it…
Kay
  • 745
  • 5
  • 15
15
votes
1 answer

How can the L1, L2, L3 CPU caches be turned off on modern x86/amd64 chips?

Every modern high-performance CPU of the x86/x86_64 architecture has some hierarchy of data caches: L1, L2, and sometimes L3 (and L4 in very rare cases), and data loaded from/to main RAM is cached in some of them. Sometimes the programmer may want…
osgx
  • 90,338
  • 53
  • 357
  • 513
15
votes
4 answers

What is the difference in cache memory and tightly coupled memory

Due to being embedded inside the CPU The TCM has a Harvard-architecture, so there is an ITCM (instruction TCM) and a DTCM (data TCM). The DTCM can not contain any instructions, but the ITCM can actually contain data. The size of DTCM or ITCM is…
mrigendra
  • 1,472
  • 3
  • 19
  • 33
15
votes
4 answers

How to get the size of the CPU cache in Linux

I have executed the following query: free -m And output of this command is: total used free shared buffers cached Mem: 2048 2018 29 5 0 595 I want to get the size of…
Naveen Kumar Mishra
  • 321
  • 1
  • 3
  • 16
15
votes
4 answers

Measure size and way-order of L1 and L2 caches

How can I programmatically measure (not query the OS) the size and order of associativity of L1 and L2 caches (data caches)? Assumptions about system: It has L1 and L2 cache (may be L3 too, may be cache sharing), It may have a hardware prefetch…
osgx
  • 90,338
  • 53
  • 357
  • 513
15
votes
2 answers

C optimization: conditional store to avoid dirtying a cache line

In the libuv source, I found this code: /* The if statement lets the compiler compile it to a conditional store. * Avoids dirtying a cache line. */ if (loop->stop_flag != 0) loop->stop_flag = 0; Can someone explain this a bit? What…
Albert
  • 65,406
  • 61
  • 242
  • 386
15
votes
2 answers

CPU cache behaviour/policy for file-backed memory mappings?

Does anyone know which type of CPU cache behaviour or policy (e.g. uncacheable write-combining) is assigned to memory mapped file-backed regions on modern x86 systems? Is there any way to detect which is the case, and possibly override the default…
awdz9nld
  • 1,663
  • 15
  • 26