Questions tagged [cpu-cache]

A CPU-cache is a hardware structure used by the CPU to reduce the average access memory time.

Caching is beneficial once some data elements get re-used.


Caching is a general policy
aimed at eliminating latency,
already paid for next, repetitive re-accessing some already visited
but otherwise "expensive" ( read slow ) resource ( storage )


Caching does not speed-up memory access.

The maximum a professional programmer can achieve is to pay attention and excercise a due care to allow some latency-masking in a concurrent-mode of code execution, with a carefully issued instructions well beforehand before a forthcoming memory data is indeed consumed, so that the cache-management can release a LRU part and pre-fetch the requested data from slow DRAM.


How it works?

Main memory is usually built with DRAM technology, that allows for big, dense and cheap storage structures. But DRAM access is much slower than the cycle time of a modern CPU (the so called memory wall). A CPU-cache is a smaller memory, usually built with SRAM technology (expensive, but fast) that reduces the amount of accesses to main memory by storing the main memory contents that are likely to be referenced in the near future. Caches exploit a property of programs: the principle of locality, which means adjacent memory addresses are likely to be referenced close in time (spatial locality), and if an address is referenced once, it is more likely to be referenced again soon (temporal locality). Latency pipeline for memory, disk, network, etc

The CPU cache is tagged with an address which are extra SRAM cells. These tag cells indicate the specific address that holds the data. The CPU cache can never mirror the entire system memory so this address must be stored. The index in the array forms a set. The index and the tag can use either physical or virtual (MMU) addresses; leading to the three types PIPT, VIVT, VIPT.

Modern CPUs contain multiple levels of cache. In SMP situations a CPU cache level may be private to a single CPU, a cluster of CPUs or the whole system. Because caching can result in multiple copies of data being present in an SMP system, cache coherence protocols are used to keep data consistent. The VIVT and VIPT type caches can also result in interactions with the MMU (and its cache commonly called a TLB).

Questions regarding CPU cache inconsistencies, profiling or under-utilization are on-topic.

For more information see Wikipedia's CPU-cache article.

Also: ,

1011 questions
55
votes
2 answers

Do current x86 architectures support non-temporal loads (from "normal" memory)?

I am aware of multiple questions on this topic, however, I haven't seen any clear answers nor any benchmark measurements. I thus created a simple program that works with two arrays of integers. The first array a is very large (64 MB) and the second…
Daniel Langr
  • 22,196
  • 3
  • 50
  • 93
53
votes
5 answers

Why is linear read-shuffled write not faster than shuffled read-linear write?

I'm currently trying to get a better understanding of memory/cache related performance issues. I read somewhere that memory locality is more important for reading than for writing, because in the former case the CPU has to actually wait for the data…
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
52
votes
7 answers

Where is the L1 memory cache of Intel x86 processors documented?

I am trying to profile and optimize algorithms and I would like to understand the specific impact of the caches on various processors. For recent Intel x86 processors (e.g. Q9300), it is very hard to find detailed information about cache structure.…
Brent Bradburn
  • 51,587
  • 17
  • 154
  • 173
52
votes
3 answers

What's the difference between conflict miss and capacity miss

Capacity miss occurs because blocks are being discarded from cache because cache cannot contain all blocks needed for program execution (program working set is much larger than cache capacity). Conflict miss occurs in the case of set associative or…
xiaodong
  • 952
  • 1
  • 7
  • 19
52
votes
4 answers

How can I do a CPU cache flush in x86 Windows?

I am interested in forcing a CPU cache flush in Windows (for benchmarking reasons, I want to emulate starting with no data in CPU cache), preferably a basic C implementation or Win32 call. Is there a known way to do this with a system call or even…
user183135
  • 3,095
  • 2
  • 18
  • 7
40
votes
7 answers

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

Why is the size of L1 cache smaller than that of the L2 cache in most of the processors ?
Karthik Balaguru
  • 7,424
  • 7
  • 48
  • 65
38
votes
9 answers

Can I force cache coherency on a multicore x86 CPU?

The other week, I wrote a little thread class and a one-way message pipe to allow communication between threads (two pipes per thread, obviously, for bidirectional communication). Everything worked fine on my Athlon 64 X2, but I was wondering if…
Furious Coder
  • 1,160
  • 1
  • 11
  • 17
38
votes
4 answers

What is meant by data cache and instruction cache?

From here: Instructions and data have different access patterns, and access different regions of memory. Thus, having the same cache for both instructions and data may not always work out. Thus, it's rather common to have two caches: an…
Celeritas
  • 14,489
  • 36
  • 113
  • 194
36
votes
6 answers

Temporal vs Spatial Locality with arrays

I am a little confused on the meanings of spatial and temporal locality. I'm hoping by looking at it with an array example it will help me understand it better. In an example like this: A[0][1], A[0][2], A[0][3].... etc Does this demonstrate…
Eric Smith
  • 1,336
  • 4
  • 17
  • 32
35
votes
2 answers

What use is the INVD instruction?

The x86 INVD invalidates the cache hierarchy without writing the contents back to memory, apparently. I'm curious, what use is such an instruction? Given how one has very little control over what data may be in the various cache levels and even less…
Dolda2000
  • 25,216
  • 4
  • 51
  • 92
35
votes
4 answers

Cycles/cost for L1 Cache hit vs. Register on x86?

I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors? How many cycles does an L1 cache hit take? How does it compare to register…
user541686
  • 205,094
  • 128
  • 528
  • 886
33
votes
4 answers

What is locality of reference?

I am having problem in understanding locality of reference. Can anyone please help me out in understanding what it means and what is, Spatial Locality of reference Temporal Locality of reference
user379888
31
votes
1 answer

What are _mm_prefetch() locality hints?

The intrinsics guide says only this much about void _mm_prefetch (char const* p, int i) : Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i. Could you list the…
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
30
votes
4 answers

Does a memory barrier ensure that the cache coherence has been completed?

Say I have two threads that manipulate the global variable x. Each thread (or each core I suppose) will have a cached copy of x. Now say that Thread A executes the following instructions: set x to 5 some other instruction Now when set x to 5 is…
30
votes
8 answers

How to programmatically get the CPU cache line size in C++?

I'd like my program to read the cache line size of the CPU it's running on in C++. I know that this can't be done portably, so I will need a solution for Linux and another for Windows (Solutions for other systems could be useful to others, so post…
Mathieu Pagé
  • 10,764
  • 13
  • 48
  • 71
1
2
3
67 68