space optimize a large array with many duplicates

Question

I have an array where the index doubles as 'identifier for a collection of items' and the content of the array is a group-number. The group numbers fall into a finite range from 0..N, where N << length_of_the_array. Hence every is entry will be duplicated large number of times. Currently I have to use 2 bytes to represent group number (which can be > 1000 but < 6500 ), which due to the duplicated nature ends up consuming a lot of memory.

Are there ways to space optimize this array as the complete array can get into multiple MBs in size. Appreciate any pointers toward relevant optimization algorithm/technique. FYI: The programming language im using is cpp.

The presence of duplicate values doesn't automatically mean the data can be compressed. After all, there are a couple billion ones and zeros in your computer's memory, but we can't save any space by keeping a canonical 1 and 0 and having each bit point to the right value or anything like that. While we could save space here by using 13 bits per value instead of 16, going to 12 bits/datum or lower would require exploiting some other sort of structure to the data. — user2357112, Oct 20 '15 at 00:29
Hmm... any pointer for the data structures to exploit the pattern here ? — broun, Oct 20 '15 at 04:28
The most effective techniques would require knowledge of exactly what kind of patterns are in the data, but generic strategies like Huffman coding, run-length encoding, and other compression techniques could save space at the expense of access time. — user2357112, Oct 20 '15 at 05:21

Peter Cordes · Answer 1 · 2015-10-20T06:39:35.887

Do you still want efficient random-access to arbitrary elements? Or are you thinking about space-efficient serialization of the index->group map?

If you still want efficient random access, a single array lookup is not bad. It's at worst a single cache miss. Well really, at worst a page fault, or more likely a TLB miss, but that's unlikely if it's only a couple MB).

A sorted and run-length encoded list could be binary-searched (by searching an array of prefix-sums of the repeat-counts), but that only works if you can occasionally sort the list to keep duplicates together.

If the duplicates can't be at least somewhat grouped together, there's not much you can do that allows random access.

Packed 12-bit entries are probably not worth the trouble, unless that was enough to significantly reduce cache misses. A couple multiply instructions to generate the right address, and a shift and mask instruction on the 16b load containing the desired value, is not much overhead compared to a cache miss. Write access to packed bitfields is slower, and isn't atomic, so that's a serious downside. Getting a compiler to pack bitfields using structs can be compiler-specific. Maybe just using a char array would be best.

space optimize a large array with many duplicates

1 Answers1