Questions tagged [cpu-cache]

A CPU-cache is a hardware structure used by the CPU to reduce the average access memory time.

Caching is beneficial once some data elements get re-used.

Caching is a general policy
aimed at eliminating latency,
already paid for next, repetitive re-accessing some already visited
but otherwise "expensive" ( _{read slow} ) resource ( _storage )

Caching does not speed-up memory access.

The maximum a professional programmer can achieve is to pay attention and excercise a due care to allow some latency-masking in a concurrent-mode of code execution, with a carefully issued instructions well beforehand before a forthcoming memory data is indeed consumed, so that the cache-management can release a LRU part and pre-fetch the requested data from slow DRAM.

How it works?

Main memory is usually built with DRAM technology, that allows for big, dense and cheap storage structures. But DRAM access is much slower than the cycle time of a modern CPU (the so called memory wall). A CPU-cache is a smaller memory, usually built with SRAM technology (expensive, but fast) that reduces the amount of accesses to main memory by storing the main memory contents that are likely to be referenced in the near future. Caches exploit a property of programs: the principle of locality, which means adjacent memory addresses are likely to be referenced close in time (spatial locality), and if an address is referenced once, it is more likely to be referenced again soon (temporal locality).

The CPU cache is tagged with an address which are extra SRAM cells. These tag cells indicate the specific address that holds the data. The CPU cache can never mirror the entire system memory so this address must be stored. The index in the array forms a set. The index and the tag can use either physical or virtual (MMU) addresses; leading to the three types PIPT, VIVT, VIPT.

Modern CPUs contain multiple levels of cache. In SMP situations a CPU cache level may be private to a single CPU, a cluster of CPUs or the whole system. Because caching can result in multiple copies of data being present in an SMP system, cache coherence protocols are used to keep data consistent. The VIVT and VIPT type caches can also result in interactions with the MMU (and its cache commonly called a TLB).

Questions regarding CPU cache inconsistencies, profiling or under-utilization are on-topic.

For more information see Wikipedia's CPU-cache article.

Also: tlb, mmu

1011 questions

votes

1 answer

Why are these matrix transposition times so counter-intuitive?

The following example code generates a matrix of size N, and transposes it SAMPLES number of times. When N = 512 the average execution time of the transposition operation is 2144 μs (coliru link). At first look there is nothing special, right?...…

c++ performance matrix cpu-cache low-latency

asked Mar 02 '17 at 20:02

Narek Atayan

1,479
13
27

votes

2 answers

Set Associative Cache: Calculate size of tag?

I'm struggling to solve this question, I've looked around but all of the similar questions are more advanced than mine, making use of logs, it's more advanced than we've done in our class. Here's the question: Suppose you have a 4-way set…

caching cpu-cache

asked Jun 11 '14 at 12:58

janderson

votes

4 answers

read CPU cache contents

Is there any way to read the CPU cache contents? Architecture is for ARM. I m invalidating a range of addresses and then want to make sure whether it is invalidated or not. Although I can do read and write of the range of addresses with and without…

caching hardware cpu cpu-cache

asked Jun 17 '09 at 13:21

kumar

2,530
6
33
57

votes

2 answers

In C++11 threads, what guarantees does a std::mutex have about memory visibility?

I am currently trying to learn the C++11 threading api, and I am finding that the various resources don't provide an essential piece of information: how the CPU cache is handled. Modern CPUs have a cache for each core (meaning different threads may…

c++ multithreading c++11 cpu-cache

asked May 26 '18 at 02:19

john01dav

1,842
1
21
40

votes

3 answers

How can I pinpoint if the slowness in my program is a CPU cache issue (on Linux)?

I'm currently trying to understand some very very strange behavior in one of my C programs. Apparently, adding or removing a seemingly inconsequential line at the end of it drastically affects the performance in the rest of the program. My program…

c performance profiling cpu-cache cachegrind

asked Apr 05 '17 at 21:24

hugomg

68,213
24
160
246

votes

4 answers

CPU Cache disadvantages of using linked lists in C

I was wondering what were the advantages and disadvantages of linked-list compared to contiguous arrays in C. Therefore I read a wikipedia article about linked-lists. https://en.wikipedia.org/wiki/Linked_list#Disadvantages According to this article,…

c caching optimization linked-list cpu-cache

asked Oct 16 '16 at 14:51

ouphi

votes

1 answer

Skylake L2 cache enhanced by reducing associativity?

In Intel's optimization guide, section 2.1.3, they list a number of enhancements to the caches and memory subsystem in Skylake (emphasis mine): The cache hierarchy of the Skylake microarchitecture has the following enhancements: Higher Cache…

x86 cpu intel cpu-cache

asked Jun 22 '16 at 01:39

BeeOnRope

60,350
16
207
386

votes

4 answers

Non-temporal loads and the hardware prefetcher, do they work together?

When executing a series of _mm_stream_load_si128() calls (MOVNTDQA) from consecutive memory locations, will the hardware pre-fetcher still kick-in, or should I use explicit software prefetching (with NTA hint) in order to obtain the benefits of…

performance x86 sse cpu-cache prefetch

asked Aug 19 '15 at 19:23

BlueStrat

2,202
17
27

votes

0 answers

Is there a special benefit to consuming whole cache lines between iterations of a loop?

My program adds float arrays and is unrolled 4x when compiled with max optimizations by MSVC and G++. I didn't understand why both compilers chose to unroll 4x so I did some testing and found only occasionally a t-test on runtimes for manually…

c++ visual-c++ cpu-architecture simd cpu-cache

asked Jun 19 '22 at 05:11

Matt

votes

2 answers

How do the store buffer and Line Fill Buffer interact with each other?

I was reading the MDS attack paper RIDL: Rogue In-Flight Data Load. They discuss how the Line Fill Buffer can cause leakage of data. There is the About the RIDL vulnerabilities and the "replaying" of loads question that discusses the…

x86 cpu-architecture cpu-cache micro-architecture cpu-mds

asked Apr 09 '20 at 20:34

Daniel Näslund

2,300
3
22
27

votes

3 answers

Is there a way to check whether the processor cache has been flushed recently?

On i386 linux. Preferably in c/(c/posix std libs)/proc if possible. If not is there any piece of assembly or third party library that can do this? Edit: I'm trying to develop test whether a kernel module clear a cache line or the whole…

linux cpu flush hardware-interface cpu-cache

asked May 18 '11 at 07:28

Roman A. Taycher

18,619
19
86
141

votes

2 answers

Are there any modern CPUs where a cached byte store is actually slower than a word store?

It's a common claim that a byte store into cache may result in an internal read-modify-write cycle, or otherwise hurt throughput or latency vs. storing a full register. But I've never seen any examples. No x86 CPUs are like this, and I think all…

performance x86 arm cpu-architecture cpu-cache

asked Jan 16 '19 at 12:54

Peter Cordes

328,167
45
605
847

votes

3 answers

Objective-C cpu cache behavior

Apple provides some documentation about synchronizing variables and even order of execution. What I don't see is any documentation on CPU cache behavior. What guarantees and control does the Objective-C developer have to ensure cache coherence…

ios objective-c multithreading caching cpu-cache

asked Feb 20 '17 at 01:16

John K

votes

1 answer

Avoiding cache pollution while loading a stream of numbers

On x86 processors is there a way to load data from regular write back memory into registers without going through the cache hierarchy? My use case is that I have a big look up structure (Hash map or B-Tree). I am working through a large stream of…

assembly optimization x86 x86-64 cpu-cache

asked Jun 18 '16 at 08:45

Rajiv

2,587
2
22
33

votes

5 answers

Do memory allocation functions indicate that the memory content is no longer used?

When processing some stream of data, e.g., requests from a network, it is quite common that some temporary memory is used. For example, a URL may be split into multiple strings, each one possibly allocating memory from the heap. The use of these…

performance memory memory-management dynamic-memory-allocation cpu-cache

asked Dec 02 '15 at 08:43

Dietmar Kühl

150,225
13
225
380

Prev 1 2 3

…

67 68 Next