Questions tagged [cpu-cache]

A CPU-cache is a hardware structure used by the CPU to reduce the average access memory time.

Caching is beneficial once some data elements get re-used.


Caching is a general policy
aimed at eliminating latency,
already paid for next, repetitive re-accessing some already visited
but otherwise "expensive" ( read slow ) resource ( storage )


Caching does not speed-up memory access.

The maximum a professional programmer can achieve is to pay attention and excercise a due care to allow some latency-masking in a concurrent-mode of code execution, with a carefully issued instructions well beforehand before a forthcoming memory data is indeed consumed, so that the cache-management can release a LRU part and pre-fetch the requested data from slow DRAM.


How it works?

Main memory is usually built with DRAM technology, that allows for big, dense and cheap storage structures. But DRAM access is much slower than the cycle time of a modern CPU (the so called memory wall). A CPU-cache is a smaller memory, usually built with SRAM technology (expensive, but fast) that reduces the amount of accesses to main memory by storing the main memory contents that are likely to be referenced in the near future. Caches exploit a property of programs: the principle of locality, which means adjacent memory addresses are likely to be referenced close in time (spatial locality), and if an address is referenced once, it is more likely to be referenced again soon (temporal locality). Latency pipeline for memory, disk, network, etc

The CPU cache is tagged with an address which are extra SRAM cells. These tag cells indicate the specific address that holds the data. The CPU cache can never mirror the entire system memory so this address must be stored. The index in the array forms a set. The index and the tag can use either physical or virtual (MMU) addresses; leading to the three types PIPT, VIVT, VIPT.

Modern CPUs contain multiple levels of cache. In SMP situations a CPU cache level may be private to a single CPU, a cluster of CPUs or the whole system. Because caching can result in multiple copies of data being present in an SMP system, cache coherence protocols are used to keep data consistent. The VIVT and VIPT type caches can also result in interactions with the MMU (and its cache commonly called a TLB).

Questions regarding CPU cache inconsistencies, profiling or under-utilization are on-topic.

For more information see Wikipedia's CPU-cache article.

Also: ,

1011 questions
14
votes
2 answers

Concept of "block size" in a cache

I am just beginning to learn the concept of Direct mapped and Set Associative Caches. I have some very elementary doubts . Here goes. Supposing addresses are 32 bits long, and i have a 32KB cache with 64Byte block size and 512 frames, how much…
hektor
  • 1,017
  • 3
  • 14
  • 28
14
votes
1 answer

How does the CPU cache affect the performance of a C program

I am trying to understand more about how CPU cache affects performance. As a simple test I am summing the values of the first column of a matrix with varying numbers of total columns. // compiled with: gcc -Wall -Wextra -Ofast -march=native…
koipond
  • 306
  • 2
  • 8
14
votes
2 answers

How to explain poor performance on Xeon processors for a loop with both sequential copy and a scattered store?

I stumbled upon a peculiar performance issue when running the following c++ code on some Intel Xeon processors: // array_a contains permutation of [0, n - 1] // array_b and inverse are initialized arrays for (int i = 0; i < n; ++i) { array_b[i] =…
14
votes
2 answers

Why isn't there a data bus which is as wide as the cache line size?

When a cache miss occurs, the CPU fetches a whole cache line from main memory into the cache hierarchy. (typically 64 bytes on x86_64) This is done via a data bus, which is only 8 byte wide on modern 64 bit systems. (since the word size is 8…
Mike76
  • 899
  • 1
  • 9
  • 31
14
votes
1 answer

Cache-as-Ram (no fill mode) Executable Code

I have read about cache-as-ram mode (no-fill mode) numerous times and am wondering whether number one, can executable code be written and jumped to and if so is the executable code restricted to half of the level one cache (since the cache is really…
n00ax
  • 307
  • 3
  • 7
14
votes
2 answers

Optimising Java objects for CPU cache line efficiency

I'm writing a library where: It will need to run on a wide range of different platforms / Java implementations (the common case is likely to be OpenJDK or Oracle Java on Intel 64 bit machines with Windows or Linux) Achieving high performance is a…
mikera
  • 105,238
  • 25
  • 256
  • 415
13
votes
1 answer

Performance when Generating CPU Cache Misses

I am trying to learn about CPU cache performance in the world of .NET. Specifically I am working through Igor Ostovsky's article about Processor Cache Effects. I have gone through the first three examples in his article and have recorded results…
Jason Moore
  • 3,294
  • 15
  • 18
13
votes
2 answers

Definition/meaning of Aliasing? (CPU cache architectures)

I'm a little confused by the meaning of "Aliasing" between CPU-cache and Physical address. First I found It's definition on Wikipedia : However, VIVT suffers from aliasing problems, where several different virtual addresses may refer to the same…
wuxb
  • 2,572
  • 1
  • 21
  • 30
13
votes
3 answers

What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?

This question is specifically aimed at modern x86-64 cache coherent architectures - I appreciate the answer can be different on other CPUs. If I write to memory, the MESI protocol requires that the cache line is first read into cache, then modified…
Tim
  • 916
  • 7
  • 21
13
votes
1 answer

Why does CLFLUSH exist in x86?

I recently learned about the row hammer attack. In order to perform this attack the programmer needs to flush the complete cache hierarchy of a CPU for a specific number of addresses. My question is: why is CLFLUSH necessary in x86? What are the…
13
votes
2 answers

Will a modern processor (like the i7) follow pointers and prefetch their data while iterating over a list of them?

I want to learn how to write better code that takes advantage of the CPU's cache. Working with contiguous memory seems to be the ideal situation. That being said, I'm curious if there are similar improvements that can be made with non-contiguous…
Jonathan
  • 752
  • 1
  • 9
  • 19
13
votes
3 answers

CUDA disable L1 cache only for one variable

Is there any way on CUDA 2.0 devices to disable L1 cache only for one specific variable? I know that one can disable L1 cache at compile time adding the flag -Xptxas -dlcm=cg to nvcc for all memory operations. However, I want to disable cache only…
zeus2
  • 309
  • 2
  • 11
12
votes
1 answer

Should the cache padding size of x86-64 be 128 bytes?

I found a comment from crossbeam. Starting from Intel's Sandy Bridge, spatial prefetcher is now pulling pairs of 64-byte cache lines at a time, so we have to align to 128 bytes rather than…
QuarticCat
  • 1,314
  • 6
  • 20
12
votes
2 answers

In which condition DCU prefetcher start prefetching?

I am reading about different prefetcher available in Intel Core i7 system. I have performed experiments to understand when these prefetchers are invoked. These are my findings L1 IP prefetchers starts prefetching after 3 cache misses. It…
bholanath
  • 1,699
  • 1
  • 22
  • 40
12
votes
2 answers

Is there a way to flush the entire CPU cache related to a program?

On x86-64 platforms, the CLFLUSH assembly instruction allows to flush the cache line corresponding to a given address. Instead of flushing the cache related to a specific address, would there be a way to flush the entire cache (either the cache…
Vincent
  • 57,703
  • 61
  • 205
  • 388