Implementing LRU with timestamp: How expensive is memory store and load?

Question

I'm talking about LRU memory page replacement algorithm implement in C, NOT in Java or C++.

According to the OS course notes:

OK, so how do we actually implement a LRU? Idea 1): mark everything we touch with a timestamp. Whenever we need to evict a page, we select the oldest page (=least-recently used). It turns out that this simple idea is not so good. Why? Because for every memory load, we would have to read contents of the clock and perform a memory store! So it is clear that keeping timestamps would make the computer at least twice as slow. I

Memory load and store operation should be very fast. Is it really necessary to get rid of these little tiny operations?

In the case of memory replacement, the overhead of loading page from disk should be a lot more significant than memory operations. Why would actually care about memory store and load?

If what the notes said isn't correct, then what is the real problem with implementing LRU with timestamp?

EDIT:

As I dig deeper, the reason I can think of is like the following. These memory store and load operations happen when there is a page hit. In this case, we are not loading page from disks, so the comparison is not valid.

Since the hit rate is expected to be very high, so updating the data structure associated with LRU should be very frequent. That's why we care about the operations repeated in the udpate process, e.g., memory load and store.

But still, I'm not convincing how significant the overhead is to do memory load and store. There should be some measurements around. Can someone point me to them? Thanks!

Is this a single-threaded data structure? If not then stores from multiple cores to the same cache line can be very expensive and prevent scaling. Also, how big is the data structure? Stores to L1 hits are cheap. Stores to lines cached for write on a different NUMA node are extremely expensive. — usr, Apr 25 '15 at 20:40
We are talking about memory page replacement algorithm here. It's between memory and disk. not L1/L2 and memory. Why would we care about L1/L2? — Junji Zhi, Apr 25 '15 at 20:46
What's the point of asking? Are you concerned with performance? It seems so. Therefore, cache behavior is *very* relevant. — usr, Apr 25 '15 at 20:47

Craig S. Anderson · Accepted Answer · 2015-04-25T20:51:33.597

3

Memory load and store operations can be quite fast, but in most real life cases the memory subsystem is slower - sometimes much slower - than the CPU's execution engine.

Rough numbers for memory access times:

L1 cache hit: 2-4 CPU cycles
L2 cache hit: 10-20 CPU cycles
L3 cache hit: 50 CPU cycles
Main memory access: 100-200 CPU cycles

So it costs real time to do loads and stores. With LRU, every regular memory access will also incur the cost of a memory store operation. This alone doubles the number of memory accesses the CPU does. In most situations this will slow the program execution. In addition, on a page eviction all the timestamps will need to be read. This will be quite slow.

In addition, reading and storing the timestamps constantly means they will be taking up space in the L1 or L2 caches. Space in these caches is limited, so your cache miss rate for other accesses will probably be higher, which will cost more time.

In short - LRU is quite expensive.

edited Apr 25 '15 at 20:51

answered Apr 25 '15 at 20:34

Craig S. Anderson

6,966
4
33
46

I understand the overhead difference between CPU and memory. But we are talking about memory page replacement algorithm here. It's between memory and disk. not L1/L2 and memory. Why would we care about L1/L2? – Junji Zhi Apr 25 '15 at 20:45
I edited my comment to make it clearer. Please check it out. – Craig S. Anderson Apr 25 '15 at 20:52
Thanks Craig. The new edit makes it clearer. So there are 2 types of overhead: 1) update page timestamp, 2) scan all timestamps on eviction. Another question arises: if LRU is slow, then other algorithm should be _no_ _faster_ than LRU, e.g., CLOCK. The algorithms all need to update the data structures somehow, for example, updating the reference bit of pages, and all incur memory load and store. My point is, other algorithms may do better in type 2) overhead, but cannot get rid of type 1) overhead. Does that make sense? – Junji Zhi Apr 25 '15 at 21:08
@JunjiZhi - You are right, the type 1 overhead is there for CLOCK as well. But most CPUs have hardware support for setting the reference bit when a page is accessed so that the CPU does not need to do a separate **store** instruction. Hardware support for LRU could be implemented as well, but it takes more silicon. And the huge type #2 cost would not go away. – Craig S. Anderson Apr 25 '15 at 21:23
Got it. It seems when it comes to performance, hardware is always our last resort. – Junji Zhi Apr 25 '15 at 21:27

Implementing LRU with timestamp: How expensive is memory store and load?

1 Answers1