0

I have an application that involves a collection of arrays which can be very large (indices up to the maximum value of an int), but which are lazy - their contents are calculated on the fly and are not actually known until requested. The arrays are also immutable - the value of each element of each array is constant throughout the life of the program. The arrays are sparse in the sense that often only a small subset of all array elements are ever requested (the arrays do not contain large blocks of zeros and are not "sparse" in that sense.)

Looking up (and possibly calculating in the process) an array element can be expensive, so I want to add a caching layer. The cache should implement the following interface:

void point_cache_store (gpointer data, gsize idx, gdouble value);
gdouble point_cache_fetch (gpointer data, gsize idx);

where data serves as a unique handle for each array (there can be many of these). point_cache_fetch() should return the value argument passed to point_cache_store() with the same data and idx arguments, or indicate a cache miss by returning the special value DATUM_UNKNOWN_VALUE (the caller will never call point_cache_store with DATUM_UNKNOWN_VALUE).

The question is: how can I implement point_cache_fetch() and point_cache_store()? (They are currently no-op stubs.)

Points to consider:

  • The cache implementation must be thread-safe. Several threads are running simultaneously and any of these can call point_cache_store() or point_cache_fetch() with any data or idx arguments.
  • The cache truly is a cache; it's always OK for point_cache_fetch() to return DATUM_UNKNOWN_VALUE, even if it once knew that value. The caller will just perform an ordinary lookup in that case.
  • Remember, the arrays are immutable - for given data and idx arguments, the caller will always provide the same value argument.

I realize that there are many ways to do this and that there are tradeoffs involved. For this question, though, I am going to evaluate answers by one very specific criterion: whether they improve performance in one particular benchmark in the application that inspired the question. If you want to go the extra mile and run the benchmark yourself, here is how to do it:

git clone git://github.com/gbenison/starparse
git clone git://github.com/gbenison/burrow-owl.git -b point-cache-base

The functions point_cache_fetch() and point_cache_store() are found in "burrow/spectrum/point_cache.c". The relevant benchmark is "benchmarks/b_cache".

gcbenison
  • 11,723
  • 4
  • 44
  • 82
  • If the cache cache can forget items, then your cache interface need to add a way to free items returned from the cache. – sbridges May 11 '12 at 03:08
  • @sbridges What's there to free? `point_cache_fetch` just returns a `double`. – gcbenison May 11 '12 at 11:06
  • What is the width of gsize? I assume that gpointer is a pointer and gdouble is a double? – Jonathan Leonard May 15 '12 at 17:42
  • @jonathanLeonard On most platforms, I think that yes, a gdouble is just a double and a gpointer is just a pointer. The names come from [glib](http://developer.gnome.org/glib/2.30/glib-Basic-Types.html) and are supposed to be part of a platform-independent set of types. `gsize` is `unsigned long` – gcbenison May 15 '12 at 18:00
  • What is the distribution of idxs? I know you said the arrays are sparse but are the used elements roughly evenly distributed? – Jonathan Leonard May 16 '12 at 01:21
  • @jonathan Unfortunately that varies - for some arrays, the access pattern will tend to cover short, contiguous blocks, but for others, the pattern will be strided (e.g. access every 1000th element). It's on the to do list to instrument the code better to get some measurements of actual access patterns. – gcbenison May 16 '12 at 03:08
  • Are you sure you need a cache? You state "contents are calculated on the fly and are not actually known until requested", if the data is predictably reused then a cache may be useful, but that does not seem to be the case with your information. – Dtyree May 16 '12 at 14:03
  • @dtyree No, I am not sure that in all use cases, the cache will be a net win vs. always recalculating the values; I do think that sometimes it will. It probably varies quite a bit between use cases. I do need to collect more data on how often the same values are fetched. – gcbenison May 16 '12 at 17:28
  • I would recommend either a b-tree or red-black tree. I'm sure you can find implementations of these in C. – Jonathan Leonard May 17 '12 at 06:39

1 Answers1

0

A "very large sparse lazy array"? Sounds like you need a hash table.

From your point_cache_fetch function prototype and all through your question, I am confused about whether your cached values are doubles or immutable arrays.

I'm not going to provide an implementation, as this is not a 'coding challenge' website. You should try to find and reuse existing libraries of threadsafe hashtables and compare their performance for your specific needs.

Eldritch Conundrum
  • 8,452
  • 6
  • 42
  • 50
  • I agree with Eldritch Conundrum on the hashmap. A storage entity that held the data while offering callback functions for specific calculations could suit the problem. – Dtyree May 16 '12 at 14:21
  • @eldritchconundrum First of all, cool name. Second - what's being cached are immutable arrays of doubles. – gcbenison May 16 '12 at 17:23
  • Know of any good threadsafe hashtable libraries in C? Do they take advantage of immutability somehow? – gcbenison May 16 '12 at 23:09
  • I see.. the large sparse lazy arrays are what's cached. Then you should consider replacing them by hash tables. So, you'd have a hash table of hash tables. Maybe you can somehow combine `data` and `idx` so that you end up with only one hash table. – Eldritch Conundrum May 20 '12 at 16:24
  • I think you shouldn't insist on taking advantage of immutability here. Your arrays are lazily initialized, which makes them... actually mutable. – Eldritch Conundrum May 20 '12 at 16:25