Why O(1) strict LRU implementation is not used in production software(s) and hardware(s)?

Question

Explaining more:

While reading about LRU or Least Recently Used Cache implementation, I came across O(1) solution which uses unordered_map (c++) and a doubly linked list. Which is very efficient as accessing a element from this map is essentially O(1) (due to very good internal hashing) and then moving or deleting from this doubly linked list is also O(1).

Well Explained here

But then, on more research I came accross How is an LRU cache implemented in a CPU?

"For larger associativity, the number of states increases dramatically: factorial of the number of ways. So a 4-way cache would have 24 states, requiring 5 bits per set and an 8-way cache would have 40,320 states, requiring 16 bits per set. In addition to the storage overhead, there is also greater overhead in updating the value"

and this

Now, I am having hard time understanding why the O(1) solution won't fix most of the problems of saving states per set and "age bits" for tracking LRU?

My guess would be, using the hashmap there is no way of saving the associativity but then each entry can be a standalone entry and as the access is O(1), it should not matter, right? The only thing I can think of right now is about the size of this hashmap, but still can't figure out why?

Any help appreciated, Thanks!

This isn't a real answer, but its worth knowing that real performance often doesn't map well to big-O notation. The biggest contributors to performance are **cache-locality**, **instruction-pipelining**, **speculative-execution**, and **branch-prediction**. Things like `unordered_map` are often terrible for cache-locality, since entries are now in different buckets (heap allocations) placed in different areas of memory and have cache-misses when being looked up. Similarly linked-lists are _very poor_ due to not being contiguously accessed. — Human-Compiler, Apr 12 '21 at 20:10
O(1) is notation for asymptotic runtimes,so you care about the runtime, when your input length tends to infinity. It does not say anything about how fast it is. It could be 1ns or 1 ms. — Unlikus, Apr 12 '21 at 20:10
A linked list, while algorithmically good, is one of the worst data structures in practice to use. Every node traversal is a cache miss, so you are running at ram speed to traverse. `std::unorded_map`, is also implemented with a liked list under the hood, because of certain complexity guarantees the standard mandates. — NathanOliver, Apr 12 '21 at 20:46
A hashmap + list LRU **is** used in *software*. Why do you think it isn't? — eerorika, Apr 12 '21 at 21:09
While the complexity of LRU hasmap + linked list may be constant in relation to number of elements in the cache, it is unclear how you think this helps to reduce complexity when the set-associativity is increased. — eerorika, Apr 12 '21 at 21:10
Hardware has different constraints than software. It is O(1) as it takes a constant number of clock cycles to access an cpu cache. A linked list doesn’t make sense in circuits while it is the most popular approach in software. — Ben Manes, Apr 13 '21 at 08:33

Why O(1) strict LRU implementation is not used in production software(s) and hardware(s)?

0 Answers0