Is caching worth it, for huge amount of data?

Question

Assume that we have a key value pair data of 500TB. We can use 2.5TB memory to cache these pairs for future requests. The requests are some how random.

The probability of cache hit would be 2.5/500 = 0.5%

I know that the hit rate may increase by time if we use LFU eviction as by time more frequently keys will remain in the cache increasing cache hit rate.

So, if the throughput of the system reading from storage 10K QPS, then using cache would improve the rate by 0.05%(neglecting the memory seek time).

Then the throughput would be 10,050 QPS.

How efficient using cache in this case?

Should we go without cache?

UPDATE

I think I have a mistake here. If we have 100% hit, then throughput will be 1MQPS. If we have 0% hit, then the throughput will be 10KQPS.

Having 0.5% hit ratio (assuming linear relation) yields at

(0.5*(1M-10K)/100)+10K = 14950 QPS

That is 50% increase in the throughput.

MSalters · Answer 1 · 2019-06-27T08:11:02.520

"Somehow random" is the key.

If the request are truly random, the cache is unlikely to help. Your logic is correct. But in real systems, it turns out that many data stores have non-uniform, highly correlated access patterns.

This still holds for huge amounts of data. It doesn't matter how much data there is in total. It just matters how little is needed frequently.

[edit] The update does not make sense. You're averaging speeds there, but you need to average the time of operations.

Is caching worth it, for huge amount of data?

1 Answers1