Questions tagged [cpu-cache]

A CPU-cache is a hardware structure used by the CPU to reduce the average access memory time.

Caching is beneficial once some data elements get re-used.


Caching is a general policy
aimed at eliminating latency,
already paid for next, repetitive re-accessing some already visited
but otherwise "expensive" ( read slow ) resource ( storage )


Caching does not speed-up memory access.

The maximum a professional programmer can achieve is to pay attention and excercise a due care to allow some latency-masking in a concurrent-mode of code execution, with a carefully issued instructions well beforehand before a forthcoming memory data is indeed consumed, so that the cache-management can release a LRU part and pre-fetch the requested data from slow DRAM.


How it works?

Main memory is usually built with DRAM technology, that allows for big, dense and cheap storage structures. But DRAM access is much slower than the cycle time of a modern CPU (the so called memory wall). A CPU-cache is a smaller memory, usually built with SRAM technology (expensive, but fast) that reduces the amount of accesses to main memory by storing the main memory contents that are likely to be referenced in the near future. Caches exploit a property of programs: the principle of locality, which means adjacent memory addresses are likely to be referenced close in time (spatial locality), and if an address is referenced once, it is more likely to be referenced again soon (temporal locality). Latency pipeline for memory, disk, network, etc

The CPU cache is tagged with an address which are extra SRAM cells. These tag cells indicate the specific address that holds the data. The CPU cache can never mirror the entire system memory so this address must be stored. The index in the array forms a set. The index and the tag can use either physical or virtual (MMU) addresses; leading to the three types PIPT, VIVT, VIPT.

Modern CPUs contain multiple levels of cache. In SMP situations a CPU cache level may be private to a single CPU, a cluster of CPUs or the whole system. Because caching can result in multiple copies of data being present in an SMP system, cache coherence protocols are used to keep data consistent. The VIVT and VIPT type caches can also result in interactions with the MMU (and its cache commonly called a TLB).

Questions regarding CPU cache inconsistencies, profiling or under-utilization are on-topic.

For more information see Wikipedia's CPU-cache article.

Also: ,

1011 questions
10
votes
0 answers

More cache friendly linked list or alternative with optimal append, delete, and ordered traversal for limit order book?

I am trying to implement a stock matching engine/order book in C++, and am searching for a more cache friendly architecture. Currently, my data structures are as follows: An intrusive rb-tree for the limit prices. An intrusive doubly linked list…
Ronny Rildil
  • 101
  • 6
10
votes
2 answers

Is it possible to use Linux Perf profiler inside C++ code?

I would like to measure L1, L2 and L3 Cache hit/miss ratio of some parts of my C++ code. I am not interested to use Perf for my entire application. Can Perf be used as a library inside C++? int main() { ... ... start_profiling() //…
narengi
  • 1,345
  • 3
  • 17
  • 38
10
votes
5 answers

Are Lisp lists always implemented as linked lists under the hood?

Are Lisp lists always implemented as linked lists under the hood? Is this a problem as far as processor caching goes? If so, are there solutions that use more contiguous structures which help caching?
Sam Washburn
  • 1,817
  • 3
  • 25
  • 43
10
votes
1 answer

Why cache read miss is faster than write miss?

I need to calculate an array (writeArray) using another array (readArray) but the problem is the index mapping is not the same between arrays (Value at index x of writeArray must be calculated with value at index y of readArray) so it's not very…
Johnmph
  • 3,391
  • 24
  • 32
10
votes
1 answer

Do bank conflicts occur on non-GPU hardware?

This blog post explains how memory bank conflicts kill the transpose function's performance. Now I can't but wonder: does the same happen on a "normal" cpu (in a multithreaded context)? Or is this specific to CUDA/OpenCL? Or does it not even appear…
rubenvb
  • 74,642
  • 33
  • 187
  • 332
10
votes
2 answers

Cache size estimation on your system?

I got this program from this link (https://gist.github.com/jiewmeng/3787223).I have been searching the web with the idea of gaining a better understanding of processor caches (L1 and L2).I want to be able to write a program that would enable me to…
liv2hak
  • 14,472
  • 53
  • 157
  • 270
10
votes
4 answers

Write a program to get CPU cache sizes and levels

I want to write a program to get my cache size(L1, L2, L3). I know the general idea of it. Allocate a big array Access part of it of different size each time. So I wrote a little program. Here's my code: #include #include…
Kan Liu
  • 175
  • 1
  • 8
10
votes
1 answer

Lock-free check for modification of a global shared state in C using Cache-Line alignment

Edit: ST does not allow to post more than two links for newbies. Sorry for the missing references. I'm trying to reduce locking overhead in a C application where detecting changes on a global state is performance relevant. Even though I've been…
instilled
  • 123
  • 5
9
votes
1 answer

What does a 'Split' cache means. And how is it useful(if it is)?

I was doing a question on Computer Architecture and in it it was mentioned that the cache is a split cache, and no hazard what does this exactly means?
mskanyal
  • 133
  • 1
  • 1
  • 6
9
votes
2 answers

Cache-friendly way to collect results from multiple threads

Consider N threads doing some asynchronous tasks with small result value like double or int64_t. So about 8 result values can fit a single CPU cache line. N is equal to the number of CPU cores. On one hand, if I just allocate an array of N items,…
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
9
votes
2 answers

Look Through vs Look aside

Suppose there are 2 caches L1 and L2 L1 Hit rate of L1=0.8 Access time of l1=2ns and transfer time b/w L1 and CPU is 10ns L2 Hit rate of L2=0.9 Access time of L2 =5ns and transfer time b/w L2 and L1 is 100ns What will be the effective access…
Hemanshu Sethi
  • 139
  • 1
  • 1
  • 7
9
votes
1 answer

loop tiling. how to choose block size?

I am trying to learn the loop optimization. i found that loop tiling helps in making the array looping faster. i tried with two block of codes given below with and without loop blocking and measure the time taken for both. i did not find significant…
Sagar
  • 1,115
  • 2
  • 13
  • 22
9
votes
3 answers

How do non temporal instructions work?

I'm reading What Every Programmer Should Know About Memory pdf by Ulrich Drepper. At the beginning of part 6 theres's a code fragment: #include void setbytes(char *p, int c) { __m128i i = _mm_set_epi8(c, c, c, c, c, c, c, c, …
Pawel Batko
  • 761
  • 7
  • 19
8
votes
2 answers

WBINVD instruction usage

I'm trying to use the WBINV instruction on linux to clear the processor's L1 cache. The following program compiles, but produces a segmentation fault when I try to run it. int main() {asm ("wbinvd"); return 1;} I'm using gcc 4.4.3 and run Linux…
roelf
  • 361
  • 2
  • 4
  • 5
8
votes
1 answer

How to explicitly load a structure into L1d cache? Weird results with INVD with CR0.CD = 1 on isolated core with/without hyperthreading

My goal is to load a static structure into the L1D cache. After that performing some operation using those structure members and after done with the operation run invd to discard all the modified cache lines. So basically I want to use create a…
user45698746
  • 305
  • 2
  • 13