Questions tagged [cpu-cache]

A CPU-cache is a hardware structure used by the CPU to reduce the average access memory time.

Caching is beneficial once some data elements get re-used.


Caching is a general policy
aimed at eliminating latency,
already paid for next, repetitive re-accessing some already visited
but otherwise "expensive" ( read slow ) resource ( storage )


Caching does not speed-up memory access.

The maximum a professional programmer can achieve is to pay attention and excercise a due care to allow some latency-masking in a concurrent-mode of code execution, with a carefully issued instructions well beforehand before a forthcoming memory data is indeed consumed, so that the cache-management can release a LRU part and pre-fetch the requested data from slow DRAM.


How it works?

Main memory is usually built with DRAM technology, that allows for big, dense and cheap storage structures. But DRAM access is much slower than the cycle time of a modern CPU (the so called memory wall). A CPU-cache is a smaller memory, usually built with SRAM technology (expensive, but fast) that reduces the amount of accesses to main memory by storing the main memory contents that are likely to be referenced in the near future. Caches exploit a property of programs: the principle of locality, which means adjacent memory addresses are likely to be referenced close in time (spatial locality), and if an address is referenced once, it is more likely to be referenced again soon (temporal locality). Latency pipeline for memory, disk, network, etc

The CPU cache is tagged with an address which are extra SRAM cells. These tag cells indicate the specific address that holds the data. The CPU cache can never mirror the entire system memory so this address must be stored. The index in the array forms a set. The index and the tag can use either physical or virtual (MMU) addresses; leading to the three types PIPT, VIVT, VIPT.

Modern CPUs contain multiple levels of cache. In SMP situations a CPU cache level may be private to a single CPU, a cluster of CPUs or the whole system. Because caching can result in multiple copies of data being present in an SMP system, cache coherence protocols are used to keep data consistent. The VIVT and VIPT type caches can also result in interactions with the MMU (and its cache commonly called a TLB).

Questions regarding CPU cache inconsistencies, profiling or under-utilization are on-topic.

For more information see Wikipedia's CPU-cache article.

Also: ,

1011 questions
12
votes
1 answer

Why are these matrix transposition times so counter-intuitive?

The following example code generates a matrix of size N, and transposes it SAMPLES number of times. When N = 512 the average execution time of the transposition operation is 2144 μs (coliru link). At first look there is nothing special, right?...…
Narek Atayan
  • 1,479
  • 13
  • 27
12
votes
2 answers

Set Associative Cache: Calculate size of tag?

I'm struggling to solve this question, I've looked around but all of the similar questions are more advanced than mine, making use of logs, it's more advanced than we've done in our class. Here's the question: Suppose you have a 4-way set…
janderson
  • 963
  • 4
  • 14
  • 26
12
votes
4 answers

read CPU cache contents

Is there any way to read the CPU cache contents? Architecture is for ARM. I m invalidating a range of addresses and then want to make sure whether it is invalidated or not. Although I can do read and write of the range of addresses with and without…
kumar
  • 2,530
  • 6
  • 33
  • 57
11
votes
2 answers

In C++11 threads, what guarantees does a std::mutex have about memory visibility?

I am currently trying to learn the C++11 threading api, and I am finding that the various resources don't provide an essential piece of information: how the CPU cache is handled. Modern CPUs have a cache for each core (meaning different threads may…
john01dav
  • 1,842
  • 1
  • 21
  • 40
11
votes
3 answers

How can I pinpoint if the slowness in my program is a CPU cache issue (on Linux)?

I'm currently trying to understand some very very strange behavior in one of my C programs. Apparently, adding or removing a seemingly inconsequential line at the end of it drastically affects the performance in the rest of the program. My program…
hugomg
  • 68,213
  • 24
  • 160
  • 246
11
votes
4 answers

CPU Cache disadvantages of using linked lists in C

I was wondering what were the advantages and disadvantages of linked-list compared to contiguous arrays in C. Therefore I read a wikipedia article about linked-lists. https://en.wikipedia.org/wiki/Linked_list#Disadvantages According to this article,…
ouphi
  • 232
  • 2
  • 9
11
votes
1 answer

Skylake L2 cache enhanced by reducing associativity?

In Intel's optimization guide, section 2.1.3, they list a number of enhancements to the caches and memory subsystem in Skylake (emphasis mine): The cache hierarchy of the Skylake microarchitecture has the following enhancements: Higher Cache…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
11
votes
4 answers

Non-temporal loads and the hardware prefetcher, do they work together?

When executing a series of _mm_stream_load_si128() calls (MOVNTDQA) from consecutive memory locations, will the hardware pre-fetcher still kick-in, or should I use explicit software prefetching (with NTA hint) in order to obtain the benefits of…
BlueStrat
  • 2,202
  • 17
  • 27
10
votes
0 answers

Is there a special benefit to consuming whole cache lines between iterations of a loop?

My program adds float arrays and is unrolled 4x when compiled with max optimizations by MSVC and G++. I didn't understand why both compilers chose to unroll 4x so I did some testing and found only occasionally a t-test on runtimes for manually…
Matt
  • 179
  • 6
10
votes
2 answers

How do the store buffer and Line Fill Buffer interact with each other?

I was reading the MDS attack paper RIDL: Rogue In-Flight Data Load. They discuss how the Line Fill Buffer can cause leakage of data. There is the About the RIDL vulnerabilities and the "replaying" of loads question that discusses the…
Daniel Näslund
  • 2,300
  • 3
  • 22
  • 27
10
votes
3 answers

Is there a way to check whether the processor cache has been flushed recently?

On i386 linux. Preferably in c/(c/posix std libs)/proc if possible. If not is there any piece of assembly or third party library that can do this? Edit: I'm trying to develop test whether a kernel module clear a cache line or the whole…
Roman A. Taycher
  • 18,619
  • 19
  • 86
  • 141
10
votes
2 answers

Are there any modern CPUs where a cached byte store is actually slower than a word store?

It's a common claim that a byte store into cache may result in an internal read-modify-write cycle, or otherwise hurt throughput or latency vs. storing a full register. But I've never seen any examples. No x86 CPUs are like this, and I think all…
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
10
votes
3 answers

Objective-C cpu cache behavior

Apple provides some documentation about synchronizing variables and even order of execution. What I don't see is any documentation on CPU cache behavior. What guarantees and control does the Objective-C developer have to ensure cache coherence…
John K
  • 859
  • 1
  • 8
  • 16
10
votes
1 answer

Avoiding cache pollution while loading a stream of numbers

On x86 processors is there a way to load data from regular write back memory into registers without going through the cache hierarchy? My use case is that I have a big look up structure (Hash map or B-Tree). I am working through a large stream of…
Rajiv
  • 2,587
  • 2
  • 22
  • 33
10
votes
5 answers

Do memory allocation functions indicate that the memory content is no longer used?

When processing some stream of data, e.g., requests from a network, it is quite common that some temporary memory is used. For example, a URL may be split into multiple strings, each one possibly allocating memory from the heap. The use of these…