Questions tagged [cpu-cache]

A CPU-cache is a hardware structure used by the CPU to reduce the average access memory time.

Caching is beneficial once some data elements get re-used.


Caching is a general policy
aimed at eliminating latency,
already paid for next, repetitive re-accessing some already visited
but otherwise "expensive" ( read slow ) resource ( storage )


Caching does not speed-up memory access.

The maximum a professional programmer can achieve is to pay attention and excercise a due care to allow some latency-masking in a concurrent-mode of code execution, with a carefully issued instructions well beforehand before a forthcoming memory data is indeed consumed, so that the cache-management can release a LRU part and pre-fetch the requested data from slow DRAM.


How it works?

Main memory is usually built with DRAM technology, that allows for big, dense and cheap storage structures. But DRAM access is much slower than the cycle time of a modern CPU (the so called memory wall). A CPU-cache is a smaller memory, usually built with SRAM technology (expensive, but fast) that reduces the amount of accesses to main memory by storing the main memory contents that are likely to be referenced in the near future. Caches exploit a property of programs: the principle of locality, which means adjacent memory addresses are likely to be referenced close in time (spatial locality), and if an address is referenced once, it is more likely to be referenced again soon (temporal locality). Latency pipeline for memory, disk, network, etc

The CPU cache is tagged with an address which are extra SRAM cells. These tag cells indicate the specific address that holds the data. The CPU cache can never mirror the entire system memory so this address must be stored. The index in the array forms a set. The index and the tag can use either physical or virtual (MMU) addresses; leading to the three types PIPT, VIVT, VIPT.

Modern CPUs contain multiple levels of cache. In SMP situations a CPU cache level may be private to a single CPU, a cluster of CPUs or the whole system. Because caching can result in multiple copies of data being present in an SMP system, cache coherence protocols are used to keep data consistent. The VIVT and VIPT type caches can also result in interactions with the MMU (and its cache commonly called a TLB).

Questions regarding CPU cache inconsistencies, profiling or under-utilization are on-topic.

For more information see Wikipedia's CPU-cache article.

Also: ,

1011 questions
28
votes
5 answers

why are separate icache and dcache needed

Can someone please explain what do we gain by having a separate instruction cache and data cache. Any pointers to a good link explaining this will also be appreciated.
ango
  • 829
  • 2
  • 10
  • 23
27
votes
0 answers

On Skylake (SKL) why are there L2 writebacks in a read-only workload that exceeds the L3 size?

Consider the following simple code: #include #include #include #include #include int cpu_ms() { return (int)(clock() * 1000 / CLOCKS_PER_SEC); } int main(int argc, char** argv) { if (argc <…
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
27
votes
4 answers

How to avoid "heap pointer spaghetti" in dynamic graphs?

The generic problem Suppose you are coding a system that consists of a graph, plus graph rewrite rules that can be activated depending on the configuration of neighboring nodes. That is, you have a dynamic graph that grows/shrinks unpredictably…
MaiaVictor
  • 51,090
  • 44
  • 144
  • 286
27
votes
2 answers

How do Intel Xeon CPUs write to memory?

I'm trying to decide between two algorithms. One writes 8 bytes (two aligned 4-byte words) to 2 cache lines, the other writes 3 entire cache lines. If the CPU writes only the changed 8 bytes back to memory, then the first algorithm uses much less…
Eloff
  • 20,828
  • 17
  • 83
  • 112
26
votes
3 answers

Understanding CPU cache and cache line

I am trying to understand how CPU cache is operating. Lets say we have this configuration (as an example). Cache size 1024 bytes Cache line 32 bytes 1024/32 = 32 cache lines all together. Singel cache line can store 32/4 = 8 ints. 1) According to…
kirbo
  • 1,707
  • 5
  • 26
  • 32
25
votes
3 answers

Why does my 8M L3 cache not provide any benefit for arrays larger than 1M?

I was inspired by this question to write a simple program to test my machine's memory bandwidth in each cache level: Why vectorizing the loop does not have performance improvement My code uses memset to write to a buffer (or buffers) over and over…
hewy
  • 275
  • 3
  • 8
25
votes
5 answers

How is x86 instruction cache synchronized?

I like examples, so I wrote a bit of self-modifying code in c... #include #include // linux int main(void) { unsigned char *c = mmap(NULL, 7, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE| …
Will
  • 2,014
  • 2
  • 19
  • 42
20
votes
3 answers

Difference between cache way and cache set

I am trying to learn some stuff about caches. Lets say I have a 4 way 32KB cache and 1GB of RAM. Each cache line is 32 bytes. So, I understand that the RAM will be split up into 256 4096KB pages, each one mapped to a cache set, which contains 4…
user1876942
  • 1,411
  • 2
  • 20
  • 32
19
votes
1 answer

Is the TLB shared between multiple cores?

I've heard that TLB is maintained by the MMU not the CPU cache. Then Does One TLB exist on the CPU and is shared between all processor or each processor has its own TLB cache? Could anyone please explain relationship between MMU and L1, L2 Cache?
ruach
  • 1,369
  • 11
  • 21
18
votes
3 answers

How can caches be defeated?

I have this question on my assignment this week, and I don't understand how the caches can be defeated, or how I can show it with an assembly program.. Can someone point me in the right direction? Show, with assembly program examples, how the two…
John
  • 989
  • 1
  • 7
  • 11
18
votes
1 answer

Which cache mapping technique is used in intel core i7 processor?

I have learned about different cache mapping techniques like direct mapping and fully associative or set associative mapping, and the trade-offs between those. (Wikipedia) But I am curious which one is used in Intel core i7 or AMD processors…
Subhadip
  • 423
  • 8
  • 16
18
votes
2 answers

How does CLFLUSH work for an address that is not in cache yet?

We are trying to use the Intel CLFLUSH instruction to flush the cache content of a process in Linux at the userspace. We create a very simple C program that first access a large array and then call the CLFLUSH to flush the virtual address space of…
Mike
  • 1,841
  • 2
  • 18
  • 34
18
votes
7 answers

Cache-friendly copying of an array with readjustment by known index, gather, scatter

Suppose we have an array of data and another array with indexes. data = [1, 2, 3, 4, 5, 7] index = [5, 1, 4, 0, 2, 3] We want to create a new array from elements of data at position from index. Result should be [4, 2, 5, 7, 3, 1] Naive algorithm…
sh1ng
  • 2,808
  • 4
  • 24
  • 38
18
votes
2 answers

Difference Between a Direct-Mapped Cache and Fully Associative Cache

I can't quite understand the main differences between the two caches and I was wondering if someone could help me out? I know that with a fully associative cache an address can be stored on any line in the tag array and a direct-mapped cache can…
madcrazydrumma
  • 1,847
  • 3
  • 20
  • 38
18
votes
2 answers

Cache bandwidth per tick for modern CPUs

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD? Please, answer with both theoretical (width of ld/sd unit with its throughput in…
osgx
  • 90,338
  • 53
  • 357
  • 513
1 2
3
67 68