1

I hope to cache some DataFrames in memory to speed up my program (calculate_df is slow). My code is like

class Foo:
    cache = {}

    @classmethod
    def get_df(cls, bar):
        if bar not in cache:
            cls.cache[bar] = cls.calculate_df(bar)
        return cls.cache[bar]

    @classmethod
    def calculate_df(cls, bar):
        ......
        return df

Almost all of the time, the possible values of bar times the size of df fit into memory. However, I need to plan for cases that I have too many different bars and big df which make my cache cause memory issues. I hope to check memory usage first before I run cache[bar] = calculate_df(bar).

What is the right/best way to do such memory checks?

abisko
  • 663
  • 8
  • 21
  • What is the rule that tells you how much cache space you can use? What "memory issues" are occurring, and why are they a problem? – Karl Knechtel Sep 13 '21 at 16:25
  • I haven't come across a "memory issue" yet but I'm just thinking about bad scenarios. Basically if memory usage is too high, I want to stop adding them to `cls.cache` and return directly. I don't know what's the rule to tell me how much cache space I can use, I guess that's part of my question too. It seems like I can use `psutil.virtual_memory().available`? – abisko Sep 13 '21 at 16:37

1 Answers1

1

Instead of manually operating on such memory level in Python, you might want to consider using the decorator functools.lru_cache() where you can limit the number of items to maxsize that can be stored at any given point in time. Once maxsize is reached, it will evict the old items.

@functools.lru_cache(maxsize=128, typed=False)

Decorator to wrap a function with a memoizing callable that saves up to the maxsize most recent calls

Sample usage

from functools import lru_cache

class MyClass:
    @lru_cache(maxsize=3)
    def duplicate(self, num):
        print("called for", num)
        return num * 2

obj = MyClass()
for num in [12, 7, 12, 12, 7, 5, 15, 5, 7, 12]:
    print(num, "=", obj.duplicate(num))

Output

called for 12
12 = 24
called for 7
7 = 14
12 = 24
12 = 24
7 = 14
called for 5
5 = 10
called for 15
15 = 30
5 = 10
7 = 14
called for 12
12 = 24
  • Thank you for your suggestion. My function is actually more complicated: 1. I want to clean up cache myself because I know in what circumstances they are no longer used. My cache cleanup strategy can reduces the size by hundreds of times 2. lru_cache seems to only consider maxsize of inputs only. In my case, that is only a very small factor. The more important factor is the size of the calculated DataFrame. Therefore lru_cache doesn't seem a good solution to me. What do you think? – abisko Sep 13 '21 at 16:32
  • 1
    It seems like I can hack `lru_cache` to use `use_memory_up_to`: https://stackoverflow.com/questions/23477284/memory-aware-lru-caching-in-python but that'll be almost same as checking cache space myself – abisko Sep 13 '21 at 16:38