Why does python lru_cache performs best when maxsize is a power-of-two?

Question

If maxsize is set to None, the LRU feature is disabled and the cache can grow without bound. The LRU feature performs best when maxsize is a power-of-two.

Would anyone happen to know where does this "power-of-two" come from? I am guessing it has something to do with the implementation.

Raymond Hettinger · Accepted Answer · 2020-05-01T23:04:42.263

Where the size effect arises

The lru_cache() code exercises its underlying dictionary in an atypical way. While maintaining total constant size, cache misses delete the oldest item and insert a new item.

The power-of-two suggestion is an artifact of how this delete-and-insert pattern interacts with the underlying dictionary implementation.

How dictionaries work

Table sizes are a power of two.
Deleted keys are replaced with dummy entries.
New keys can sometimes reuse the dummy slot, sometimes not.
Repeated delete-and-inserts with different keys will fill-up the table with dummy entries.
An O(N) resize operation runs when the table is two-thirds full.
Since the number of active entries remains constant, a resize operation doesn't actually change the table size.
The only effect of the resize is to clear the accumulated dummy entries.

Performance implications

A dict with 2**n entries has the most available space for dummy entries, so the O(n) resizes occur less often.

Also, sparse dictionaries have fewer hash collisions than mostly full dictionaries. Collisions degrade dictionary performance.

When it matters

The lru_cache() only updates the dictionary when there is a cache miss. Also, when there is a miss, the wrapped function is called. So, the effect of resizes would only matter if there are high proportion of misses and if the wrapped function is very cheap.

Far more important than giving the maxsize a power-of-two is using the largest reasonable maxsize. Bigger caches have more cache hits — that's where the big wins come from.

Simulation

Once an lru_cache() is full and the first resize has occurred, the dictionary settles into a steady state and will never get larger. Here, we simulate what happens next as new dummy entries are added and periodic resizes clear them away.

steady_state_dict_size = 2 ** 7     # always a power of two

def simulate_lru_cache(lru_maxsize, events=1_000_000):
    'Count resize operations as dummy keys are added'
    resize_point = steady_state_dict_size * 2 // 3
    assert lru_maxsize < resize_point
    dummies = 0
    resizes = 0
    for i in range(events):
        dummies += 1
        filled = lru_maxsize + dummies
        if filled >= resize_point:
           dummies = 0
           resizes += 1
    work = resizes * lru_maxsize    # resizing is O(n)
    work_per_event = work / events
    print(lru_maxsize, '-->', resizes, work_per_event)

Here is an excerpt of the output:

for maxsize in range(42, 85):
    simulate_lru_cache(maxsize)

42 --> 23255 0.97671
43 --> 23809 1.023787
44 --> 24390 1.07316
45 --> 25000 1.125
46 --> 25641 1.179486
  ...
80 --> 200000 16.0
81 --> 250000 20.25
82 --> 333333 27.333306
83 --> 500000 41.5
84 --> 1000000 84.0

What this shows is that the cache does significantly less work when maxsize is as far away as possible from the resize_point.

History

The effect was minimal in Python3.2, when dictionaries grew by 4 x active_entries when resizing.

The effect became catastrophic when the growth rate was lowered to 2 x active entries.

Later a compromise was reached, setting the growth rate to 3 x used. That significantly mitigated the issue by giving us a larger steady state size by default.

A power-of-two maxsize is still the optimum setting, giving the least work for a given steady state dictionary size, but it no longer matters as much as it did in Python3.2.

Hope this helps clear up your understanding. :-)

I'd just like to point out that the author of this question is the person who added the original "power-of-two" comment referenced in the question :) — Erty Seidohl, May 01 '20 at 19:00
Thank you for explanation Raymond, it is pretty cool to have the author of this implementation comment here. :) I am trying to understand following two sentences: **A dict with 2 raised to the power n entries has the most available space for dummy entries, so the order of n resizes occur less often** It is not clear to me why the available space is most for dummy entries AND **Once an lru_cache() is full and the first resize has occurred, the dictionary settles into a steady state and will never get larger** - cache can be full without resize ever happening?. — Mr Matrix, May 02 '20 at 00:28
We have the least work per event when maxsize is 42 but 42 is not a power of 2. This is confusing me. I know 42 is ~2/3 of steady state dict size being 64. What am I missing here? — Mr Matrix, May 02 '20 at 00:47
@MrMatrix The missing part is the growth_factor. When it was ``2 x active_entries, plus one, and rounded-up to the new power of two``, the jump to the next bigger table size occurred right at a power-of-two. That bigger table gives you the most room for adding dummy entries before the next resize. — Raymond Hettinger, May 02 '20 at 01:29

Erty Seidohl · Answer 2 · 2020-05-01T19:17:03.847

TL;DR - this is an optimization that doesn't have much effect at small lru_cache sizes, but (see Raymond's reply) has a larger effect as your lru_cache size gets bigger.

So this piqued my interest and I decided to see if this was actually true.

First I went and read the source for the LRU cache. The implementation for cpython is here: https://github.com/python/cpython/blob/master/Lib/functools.py#L723 and I didn't see anything that jumped out to me as something that would operate better based on powers of two.

So, I wrote a short python program to make LRU caches of various sizes and then exercise those caches several times. Here's the code:

from functools import lru_cache
from collections import defaultdict
from statistics import mean
import time

def run_test(i):
    # We create a new decorated perform_calc
    @lru_cache(maxsize=i)
    def perform_calc(input):
        return input * 3.1415

    # let's run the test 5 times (so that we exercise the caching)
    for j in range(5):
        # Calculate the value for a range larger than our largest cache
        for k in range(2000):
            perform_calc(k)

for t in range(10):
    print (t)
    values = defaultdict(list)
    for i in range(1,1025):
        start = time.perf_counter()
        run_test(i)
        t = time.perf_counter() - start
        values[i].append(t)

for k,v in values.items():
    print(f"{k}\t{mean(v)}")

I ran this on a macbook pro under light load with python 3.7.7.

Here's the results:

https://docs.google.com/spreadsheets/d/1LqZHbpEL_l704w-PjZvjJ7nzDI1lx8k39GRdm3YGS6c/preview?usp=sharing

The random spikes are probably due to GC pauses or system interrupts.

At this point I realized that my code always generated cache misses, and never cache hits. What happens if we run the same thing, but always hit the cache?

I replaced the inner loop with:

    # let's run the test 5 times (so that we exercise the caching)
    for j in range(5):
        # Only ever create cache hits
        for k in range(i):
            perform_calc(k)

The data for this is in the same spreadsheet as above, second tab.

Let's see:

Hmm, but we don't really care about most of these numbers. Also, we're not doing the same amount of work for each test, so the timing doesn't seem useful.

What if we run it for just 2^n 2^n + 1, and 2^n - 1. Since this speeds things up, we'll average it out over 100 tests, instead of just 10.

We'll also generate a large random list to run on, since that way we'll expect to have some cache hits and cache misses.

from functools import lru_cache
from collections import defaultdict
from statistics import mean
import time
import random

rands = list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128)) + list(range(128))
random.shuffle(rands)

def run_test(i):
    # We create a new decorated perform_calc
    @lru_cache(maxsize=i)
    def perform_calc(input):
        return input * 3.1415

    # let's run the test 5 times (so that we exercise the caching)
    for j in range(5):
        for k in rands:
            perform_calc(k)

for t in range(100):
    print (t)
    values = defaultdict(list)
    # Interesting numbers, and how many random elements to generate
    for i in [15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129, 255, 256, 257, 511, 512, 513, 1023, 1024, 1025]:
        start = time.perf_counter()
        run_test(i)
        t = time.perf_counter() - start
        values[i].append(t)

for k,v in values.items():
    print(f"{k}\t{mean(v)}")

Data for this is in the third tab of the spreadsheet above.

Here's a graph of the average time per element / lru cache size:

Time, of course, decreases as our cache size gets larger since we don't spend as much time performing calculations. The interesting thing is that there does seem to be a dip from 15 to 16, 17 and 31 to 32, 33. Let's zoom in on the higher numbers:

Not only do we lose that pattern in the higher numbers, but we actually see that performance decreases for some of the powers of two (511 to 512, 513).

Edit: The note about power-of-two was added in 2012, but the algorithm for functools.lru_cache looks the same at that commit, so unfortunately that disproves my theory that the algorithm has changed and the docs are out of date.

Edit: Removed my hypotheses. The original author replied above - the problem with my code is that I was working with "small" caches - meaning that the O(n) resize on the dicts was not very expensive. It would be cool to experiment with very large lru_caches and lots of cache misses to see if we can get the effect to appear.

Power-of-two maxsizes give the best cache miss performance *for a given dictionary size*. If that size is allowed to grow larger, there won't be a speed hit. Instead, the cost comes in terms of memory consumption. Presumably, the only reason people set a *maxsize* at all is because they want to limit memory usage. — Raymond Hettinger, May 01 '20 at 23:13