Compressing large dictionaries

Question

I was trying to optimize memory usage of a particular service and stumbled upon a huge dictionary cache which gets queried for random entries very frequently. The problem is this dictionary takes up more than 1 GB and the service is almost touching 2GB (32 Bit). The dictionary once constructed is never altered.

The dictionary key and values are strings. Is there a way to compress the entire dictionary and it still be indexed? I wrote a small POC which uses Huffman encoding sharing codes between all entries and is indexed on compressed keys. but I want to know if there are any better alternatives.

The options I'll have to rule out due to various reasons - Using database or external storage as it becomes extremely slow & - All the entries get used atleast once within a few minutes, so I also ruled out lazy loading. - Using distibuted cache

What kind of keys? Are the keys similar to a certain extent? Like "ABC", "ABD", "ABE", etc.? Also, from where are the values obtained, and can they be duplicates? — Lasse V. Karlsen, Apr 28 '14 at 09:38
Do the strings duplicate much? Interning them might be a quick win. — Adam Houldsworth, Apr 28 '14 at 09:39
Interning is a permanent solution, but may be acceptable if the dictionary stays in memory as long as the process is running, and doesn't change, but once added to the interning table, you can't remove it. You could implement "soft interning" though if the strings are generated/retrieved from external media, and have lots of duplicates. — Lasse V. Karlsen, Apr 28 '14 at 09:40
Lasse, Adam.. It would have the same frequency as normal english language text with a few high frequency words in the Value — v1p3r, Apr 28 '14 at 09:41
Note that this question is bordering on too broad for Stack Overflow, as likely only a discussion will give you the answer. — Lasse V. Karlsen, Apr 28 '14 at 09:42
What are you storing in the values? Words? Larger text fragments? Whole documents? — Lasse V. Karlsen, Apr 28 '14 at 09:44
I haven't given interning a try because values are not single words but a few words which i don't think would be repeating often. Keys obviously wont be repeating. — v1p3r, Apr 28 '14 at 09:45
@Rahul More conceptually, move to a Least Frequently Used algorithm and move unused keys into local file storage, perhaps splitting files by initial letter. If you happen to load them from file, reserve a small section of memory and use Least Recently Used to keep that maintained, just to combat the instances where something is referenced a few times quickly, but then never again. It should go without saying that `Dictionary` alone will not give you any of this. — Adam Houldsworth, Apr 28 '14 at 09:48
Usually in caches the most space consumes not keys, but values, especially if you store arrays. So compressing or reusing keys won't help a lot. Memory space isn't usually a high-cost resource nowadays. Isn't it an option to move to x64 process? If you pack your cache more tightly - that will decrease read performance — Sasha, Apr 28 '14 at 09:49
@Rahul Depending on the access telemetry, your frequency might be measured over a small time-scale for LFU. For LRU I would do a simple time-degrade from the point of access. You will also need to be able to promote items into the LFU cache and demote items from it to the file store / LRU cache set up. If you abstract the cache implementation, the fact that it is in-memory dictionary or local file store is moot as it could be that or distributed or in another local process - the important concepts would be two tiers of cache and cache management routines to handle scaling. — Adam Houldsworth, Apr 28 '14 at 09:51
@Rahul The added benefit of this approach is you then do not have hard requirements on memory footprint. You can provide a hungry cache for machines with ample memory, and a lean cache (with the caveat of slower performance) for low-memory scenarios. — Adam Houldsworth, Apr 28 '14 at 09:56

score 0 · Accepted Answer · answered Apr 28 '14 at 09:43

I would move the cache to another process. Even better, I would use a IIS service with MemoryCache (http://msdn.microsoft.com/en-us/library/system.runtime.caching.memorycache(v=vs.110).aspx) and query the service. I am aware that there will be come overhead, but the throughput should be good enough.

Compressing large dictionaries

1 Answers1