Questions tagged [cpu-cache]

A CPU-cache is a hardware structure used by the CPU to reduce the average access memory time.

Caching is beneficial once some data elements get re-used.


Caching is a general policy
aimed at eliminating latency,
already paid for next, repetitive re-accessing some already visited
but otherwise "expensive" ( read slow ) resource ( storage )


Caching does not speed-up memory access.

The maximum a professional programmer can achieve is to pay attention and excercise a due care to allow some latency-masking in a concurrent-mode of code execution, with a carefully issued instructions well beforehand before a forthcoming memory data is indeed consumed, so that the cache-management can release a LRU part and pre-fetch the requested data from slow DRAM.


How it works?

Main memory is usually built with DRAM technology, that allows for big, dense and cheap storage structures. But DRAM access is much slower than the cycle time of a modern CPU (the so called memory wall). A CPU-cache is a smaller memory, usually built with SRAM technology (expensive, but fast) that reduces the amount of accesses to main memory by storing the main memory contents that are likely to be referenced in the near future. Caches exploit a property of programs: the principle of locality, which means adjacent memory addresses are likely to be referenced close in time (spatial locality), and if an address is referenced once, it is more likely to be referenced again soon (temporal locality). Latency pipeline for memory, disk, network, etc

The CPU cache is tagged with an address which are extra SRAM cells. These tag cells indicate the specific address that holds the data. The CPU cache can never mirror the entire system memory so this address must be stored. The index in the array forms a set. The index and the tag can use either physical or virtual (MMU) addresses; leading to the three types PIPT, VIVT, VIPT.

Modern CPUs contain multiple levels of cache. In SMP situations a CPU cache level may be private to a single CPU, a cluster of CPUs or the whole system. Because caching can result in multiple copies of data being present in an SMP system, cache coherence protocols are used to keep data consistent. The VIVT and VIPT type caches can also result in interactions with the MMU (and its cache commonly called a TLB).

Questions regarding CPU cache inconsistencies, profiling or under-utilization are on-topic.

For more information see Wikipedia's CPU-cache article.

Also: ,

1011 questions
7
votes
0 answers

In which conditions the L1 IP-based stride prefetcher will be triggered?

Intel hardware Prefetcher Intel website shows that there are four kinds of hardware prefechers. The prefetcher controlled by bit 3 is the L1 stride prefetcher. I am running a test code to test what's the trigger condition of the stride prefetcher.…
JasperMa
  • 71
  • 4
7
votes
1 answer

Reducing bus traffic for cache line invalidation

Shared-memory multiprocessing systems typically need to generate a lot of traffic for cache coherence. Core A writes to cache. Core B might later read the same memory location. Therefore, core A, even if it would otherwise have avoided writing to…
rwallace
  • 31,405
  • 40
  • 123
  • 242
7
votes
1 answer

Does this prefetch256() function offer any protection against cache timing attacks on AES?

This is a borderline topic. Since I wanted to know about programming, CPU cache memory, reading CPU cache lines etc, I'm posting it here. I was implementing AES algorithm in C/C++. Since performing GF(28) multiplications are computationally…
Vivekanand V
  • 340
  • 2
  • 12
7
votes
2 answers

How does the indexing of the Ice Lake's 48KiB L1 data cache work?

The Intel manual optimization (revision September 2019) shows a 48 KiB 8-way associative L1 data cache for the Ice Lake microarchitecture. 1 Software-visible latency/bandwidth will vary depending on access patterns and other factors. This baffled…
Margaret Bloom
  • 41,768
  • 5
  • 78
  • 124
7
votes
0 answers

Why is the L2 cache miss rate for following a randomized singly linked list not monotonic in problem size?

I'm currently reading Ulrich Drepper's "What every programmer should know about memory". The relevant chapter as html is ħere, pdfs of the entire text are also available and easy to find. To explain the effects of CPU cache on performance he goes…
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
7
votes
1 answer

CPU affinity in virtualised environments

Is taskset for CPU affinity applicable when trying to use L2 cache efficiently on a multi core processor in a virtualised environment like Amazon EC2?
David Kierans
  • 1,599
  • 1
  • 16
  • 24
7
votes
1 answer

VIPT Cache: Connection between TLB & Cache?

I just want to clarify the concept and could find detail enough answers which can throw some light upon how everything actually works out in the hardware. Please provide any relevant details. In case of VIPT caches, the memory request is sent in…
Uchia Itachi
  • 5,287
  • 2
  • 23
  • 26
7
votes
0 answers

How to abandon (invalidate without saving) a cache line on x86_64?

As I understand, _mm_clflush() / _mm_clflushopt() invalidates a cache line while saving it to memory if it has been changed. Is there a way to simply abandon a cache line, without saving to memory any changes made to it? A use case is before freeing…
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
7
votes
1 answer

WC vs WB memory? Other types of memory on x86_64?

Could you describe the meanings and the differences between WC and WB memory on x86_64? For completeness, please, describe other types of memory on x86_64, if any.
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
7
votes
1 answer

L2 instruction fetch misses much higher than L1 instruction fetch misses

I am generating a synthetic C benchmark aimed at causing a large number of instruction fetch misses via the following Python script: #!/usr/bin/env python import tempfile import random import sys if __name__ == '__main__': functions = list() …
Marco Guerri
  • 932
  • 5
  • 11
7
votes
2 answers

How to implement a cache friendly dynamic binary tree?

According to several sources, including Wikipedia, the two most used ways of implementing a binary tree are: Nodes and pointers (or references) where each node explicitly holds its children. Array where the position of child nodes is given…
Martin Drozdik
  • 12,742
  • 22
  • 81
  • 146
7
votes
2 answers

Even faster inexpensive thread-safe counter?

I've read this topic: C# Thread safe fast(est) counter and have implemented this feature in my parallel code. As far as I can see it all works fine, however it has measurably increased the processing time, as in 10% or so. It's been bugging me a…
mmix
  • 6,057
  • 3
  • 39
  • 65
7
votes
2 answers

What does the processor do while waiting for a main memory fetch

Assuming l1 and l2 cache requests result in a miss, does the processor stall until main memory has been accessed? I heard about the idea of switching to another thread, if so what is used to wake up the stalled thread?
user1223028
  • 523
  • 1
  • 4
  • 8
7
votes
1 answer

Invalidating the CPU's cache

When my program performs a load operation with acquire semantics/store operation with release semantics or perhaps a full-fence, it invalidates the CPU's cache. My question is this: which part of the cache is actually invalidated? only the…
unknown
7
votes
2 answers

MSI/MESI: How can we get "read miss" in shared state?

In The Cache Memory Book by Jim Handy (excerpt is below), the author has the table description of MESI protocol. The table looks very unclear to me, and unfortunately the text does not help. The first question (in green on the picture): Is this…
Ayrat
  • 1,221
  • 1
  • 18
  • 36