4

We have the following requirements to a cache (java).

  • Cache-entries have different priority regarding eviction - the following properties of an entry will be a factor in this priority
    • When it was inserted or last used (LRU)
    • The resources needed to recalculate the entry (when evicted). This factor will change, because from time to time we get an entry from the cache and "add information to it", making it require more resources to recalculate after eviction. On this parameter the eviction priority of entries is very discrete - do not have to support e.g. any possible long/double-value. Lets just say than any entry have a priority in the range of natural numbers 1-10 (only 10 possible eviction priority values)

Guess it could be done using a cache-implementation supporting eviction-policy plugin. EHCache seem to support that. Unfortunately Guava cache does not.

But I am afraid of performance and flexibility if the implementation is actually only using one inner cache that tries to find the entry with lowest stay-priority when it has to evict. If it is implemented in a way so that an entry registers (with the cache) its new priority, when its priority changes, and that the cache maintains a priority queue I would not be that afraid. Anyone know of a cache-implementation doing it this way? Anyone know what EHCache is actually doing?

It is also hard for us to actually calculate a combined priority from the two factors mentioned above. It is hard to calculate a fair balance between "how recently is was used" and "how resource-consuming it is to recalculate it".

Currently we have made our own cache-implementation having a list of inner caches (Guava caches). Each inner cache just uses LRU eviction strategy. When an entry changes on "how resource-consuming it is to recalculate it" it moves to "the next" inner cache. This way we do not have to calculate a combined eviction priority value, but can have different max-sizes etc on each inner cache. Actually we like the flexibility this gives, but we would prefer not to do and maintain the cache-implementation ourselves. We would rather use a cache-implementation from some open-source project. Anyone know an open-source cache-implementation that supports this multi-level-inner-cache functionality? Or maybe an open-source project that would like to adopt our implementation?

Per Steffensen
  • 613
  • 2
  • 7
  • 19
  • If the "recalculation" is just burning CPU cycles (no i/o) I would not care about this too much. – laune Mar 03 '15 at 09:49
  • Long story short: You seem to want a cache with mutable "weight" of entries. Guava doesnt have this, except maybe `LoadingCache#refresh(key)`. – Markus Kull Mar 03 '15 at 09:54
  • Recalculation also requires I/O. We know that the cache makes a significant difference for us, so that is not the question. It is just that we would rather use a third-party cache fulfilling our needs than implement and maintain one ourselves – Per Steffensen Mar 03 '15 at 19:14

1 Answers1

0

Your question is very specialized and might fall into the XYProblem category. The technical approach, you are asking for, may not be the right solution to the problem, or, it adds complexity to the system but has less benefit compared to it.

The solution idea, building segmented caches, and assign the segment based on the reproduction cost, seems reasonable to solve your problem statement. However, it may not yield a good over all optimization.

What you are asking for, is that the cache should account recency and priority for the eviction decision. This may have adverse effects, since it misses to take the frequency into account. E.g. an entry with a reproduction cost of 99 may be evicted and the entry with cost 100 may be kept. But what if the entry with cost 99 is accessed three times more often? The aggregated reproduction cost gets worse.

The inherent question is: You want to keep an entry, because it is important and expensive to reproduce. "Important" means it is used by the application very often, and that means, it is being accessed. Why should the cache evict it then?

Here some random thoughts that come to my mind to tackle the problem alternatively:

Simply increase the cache size. With Guava you are constrained on the heap. Maybe use a cache with persistence to overcome this, like infinispan or hazelcast.

Why does the cache evict the important entry anyway? Check that the application really access the cache and does not hold references or impose an additional caching on top. Maybe the access pattern of your application is not LRU friendly. There are better eviction algorithms then LRU, e.g. ARC, LIRS or Clock-Pro. Maybe a modern eviction algorithms keeps your costly entries anyway.

Isn't it possible to derive a reasonable cost estimate from the cache key? Maybe it is possible and "good enough" to segment the caches based on the key. This way it is also more transparent which entries go into what segment.

One final remark:

I like your thoughts on this, because I think the cache is a good place to tune the typical time vs space design decision. But, if you start to "improve" and add prioritized eviction, then go the whole way: Evaluate the different approaches and make sure that the used resources really get lower.

cruftex
  • 5,545
  • 2
  • 20
  • 36
  • > But what if the entry with cost 99 is accessed three times more often? The aggregated reproduction cost gets worse. It is a matter of setting the sizes of the inner caches properly > "Important" means it is used by the application very often, and that means, it is being accessed. Why should the cache evict it then? No important it dependent on both mentioned factors. In practice we put millions of entries into the cache every day. It is relatively likely that a cache-entry is used a one or two times... – Per Steffensen Mar 11 '15 at 11:47
  • ... shortly after it was inserted (and never used again) - lets say that 10% of the entries are used within the first 10 mins. We need to insert “all data” that we come across into the cache, because we have no clue which 10% is fetched within those 10 mins. Lets say that the size of our cache and the insert-frequency makes entires evict after 30 mins. Then there are those “extra important” entries that, are used like hundreds of times but across an entire week or month. Every time they are used we add information to the entry making it more expensive to “recalculate” in case of eviction. – Per Steffensen Mar 11 '15 at 11:48
  • ...We really need to keep those in cache, but it is likely that they are not used often enough (every 30 min) to automatically stay in cache. We cannot recognise those entries when first coming across them - to begin with they are just like any other cache-entry. > use a cache with persistence Not worth much. What make “recalculation” expensive is reading the info from disk anyway. > Isn't it possible to derive a reasonable cost estimate from the cache key No. We have not clue about those “important entries” when we see them for the first time – Per Steffensen Mar 11 '15 at 11:48
  • > Evaluate the different approaches and make sure that the used resources really get lower We have lots of measurements on this. We started out with one single cache, but we logged how many of the “expensive” entries we had to recalculate, and how expensive it was. Implementing the multi-level-cache improved in this area, and we have measurements that we are almost “perfect” now. – Per Steffensen Mar 11 '15 at 11:49
  • Hi Per, thanks for the deeper elaboration! Can you provide an access trace? In the cache2k benchmark (on github) I have a collection of access traces and compare eviction algorithms. An access trace is just a file with an ID of an accessed resource line by line. – cruftex Mar 11 '15 at 17:21
  • Actually you are. at least partly, reinventing in the field of eviction algorithms. Modern eviction algorithms work by segmenting the cache and keep entries separately that had more accesses then others. I think one of the first was the 2Q algorithm which came out 1994. The new algorithms I mentioned in my answer all work with two "segments". – cruftex Mar 11 '15 at 17:35