3

I am trying to efficiently compress sets of numbers which look like this (one set per line):

19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 179 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 179 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 206
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 206 45387
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 206 45387 45392
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 206 45387
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 206 45392
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 206 45392 144554
19 20 23 24 27 29 32 35 69 97 99 119 122 129 130 134 136 137 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205
19 20 23 24 27 29 32 35 69 97 99 119 122 129 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 45392 144554
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205
19 20 23 24 27 29 32 35 69 97 99 119 122 129 132 134 136 137 139 141 147 148 152 157 158 160 170 173 174 175 176 178 182 183 185 186 188 189 190 192 194 195 197 198 199 200 201 202 203 204 205 45392 144554

You could easily have ~10K sets, each with ~10K entries. However, as you can see from the sample data, most of the data in the sets is redundant - with a few removals and a few additions for each new set. (Occasionally there's a big change, but this is rare).

I'd like to compress this to:

  • take up a small amount of storage
  • use minimal CPU when decompressing (with random access)
  • ideally compress incrementally (but compacting it in retrospect could work too).

To achieve minimal CPU when expanding, I'm trying to build each set out of a set of common subsets - i.e. factoring out the most common recurring subsets of data, one level deep (i.e. no recursion).

To determine the common subsets to factor out, I've tried considering the sets line by line, and looking at what items are added and what items are removed. The additions are considered as new subsets, and as these build up over time, equal-sized subsets are coalesced together pairwise into new subsets. For instance, for the simple case of the Nth set being the integers 0 to N, you get:

({0}),
({0, 1}),
({0, 1}),({2}),
({0, 1, 2, 3}),
({0, 1, 2, 3}),({4}),
({0, 1, 2, 3}),({4, 5}),
({0, 1, 2, 3}),({4, 5}),({6}),
({0, 1, 2, 3, 4, 5, 6, 7}),
({0, 1, 2, 3, 4, 5, 6, 7}),({8}),
({0, 1, 2, 3, 4, 5, 6, 7}),({8, 9}),
({0, 1, 2, 3, 4, 5, 6, 7}),({8, 9}),({10}),
({0, 1, 2, 3, 4, 5, 6, 7}),({8, 9, 10, 11}),
({0, 1, 2, 3, 4, 5, 6, 7}),({8, 9, 10, 11}),({12}),
({0, 1, 2, 3, 4, 5, 6, 7}),({8, 9, 10, 11}),({12, 13}),
({0, 1, 2, 3, 4, 5, 6, 7}),({8, 9, 10, 11}),({12, 13}),({14}),
({0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}),

Then, if you keep track of the 'parent' components of each subset, when an item is removed, you can explode the given subset apart into its components (which subsequently will get coalesced again as time moves on). For instance, removing item 4 would produce something like:

({0, 1, 2, 3}),({5, 6, 7}),({8, 9, 10, 11}),({12, 13}),({14}),

...which would then coalesce to...

({0, 1, 2, 3, 8, 9, 10, 11}),({5, 6, 7}),({12, 13}),({14}),

Empirically this works fairly well (roughly 5x improvement in disk space), but I'm worried that I'm missing a more obvious way to spot which subsets can be factored out most efficiently across the overall dataset.

I've also tried building a prefix trie to track which prefixes recur the most, and then factoring these out - except this uses quite a lot of storage, and doesn't help compress subsets which aren't prefixes. It also doesn't exploit the fact that the sets are unordered.

I've also tried looking at Signature Trees (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.7315&rep=rep1&type=pdf) but these seem to use a huge amount of disk storage for when your datasets are large and not that sparse.

I could also do an O(N^2) search to compare the intersection of each set with every other, and track a histogram of which subsets recur the most, but O(N^2) will be painful for large datasets, and it's not obvious how to tune out the noise when comparing the intersections to spot the underlying common subsets.

TL;DR: what's the best way to spot structural similarity across a large number of sets in order to factor out recurring subsets?

Edit: have clarified that random access is needed when decompressing. Also, I've published a real dataset to http://matrix.org/~matthew/expanded.out.xz. Warning: this 2MB .xz expands to 4.9GB of actual data... which illustrates the problem quite well, and why it's frustrating that i haven't found an approach which does better than 5x compression so far :/

Arathorn
  • 68
  • 5
  • Is the order of the sets (i.e., the order of the lines, not the order within the set because that's a given with sets) relevant? – David Eisenstat Apr 25 '21 at 19:32
  • The lines are ordered (they're effectively showing how the set evolves over time), and so there's some correlation between adjacent lines. (This is why my current best approach tries to model the additions and removals on a line by line basis, tracking the additions as reusable subsets which can be re-used as components in future lines). – Arathorn Apr 25 '21 at 20:51
  • btw, LZMA manages a ~2000x improvement in diskspace on the actual dataset - hence looking for something simpler (especially for decompression speed) but better than the current 5x solution. – Arathorn Apr 25 '21 at 20:54
  • Do you need random access decompression? – David Eisenstat Apr 25 '21 at 20:57
  • yes. the thing needs to optimise for rapid random access decompression (sorry for not making that clear in the original post). – Arathorn Apr 25 '21 at 21:01
  • What is this first number and what does the bar notation mean: `1207976 | {229,230,...`? – Mark Adler Apr 25 '21 at 22:24
  • the first number is just a row ID (not referenced elsewhere); the pipe symbol is just a column separator. – Arathorn Apr 26 '21 at 12:37

1 Answers1

3

We can combine three simple ideas:

  1. Encode the symmetric differences between successive sets (I think this is what Mark is suggesting).

  2. This is good for encoding but hard to decode randomly. To fix it, emit the whole set periodically. One heuristic is to do this whenever we've emitted approximately as much in deltas as the whole set -- in theory, this costs only a constant factor more in storage while limiting the total size of the deltas we scan to a constant factor more than the size of the set.

  3. Use a delta encoding with varints. This is a common encoding for posting lists, so there should be optimized implementations floating around.

Python 3 encoder that compresses the given input to less than 5 MB. We also need an index, but this won't add much.

import fileinput
import re
import sys

output = open("output", "wb")


def emit_varint(n):
    buffer = []
    mask = 127
    while n > mask:
        buffer.append(128 | (n & mask))
        n >>= 7
    buffer.append(n)
    output.write(bytes(buffer))


def emit_indices(delta):
    emit_varint(len(delta))
    prev = 0
    for x in sorted(delta):
        emit_varint(x - prev)
        prev = x


delta_counter = 0
delta_from = 0
previous_indices = set()
for i, line in enumerate(fileinput.input()):
    if i % 1000 == 0:
        print(i, file=sys.stderr)
    m = re.match(r"[^{}]*\{(\d+(,\d+)*)\}", line)
    if not m:
        continue
    indices = set(map(int, re.findall("\d+", m.group(1))))
    delta = indices ^ previous_indices
    delta_counter += len(delta)
    if delta_counter + len(delta) > 2 * len(indices):
        emit_indices(indices)
        delta_counter = 0
        delta_from = i
    else:
        emit_indices(delta)
    previous_indices = indices
David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • ooh, thank you for this; I had been avoiding approaches which applying lots of deltas (even with occasional snapshots) to minimise the amount of processing on decompression, but given how well it seems to compress it sounds like this may be a situation where the processing burden is tiny given how little data needs to be processed. – Arathorn Apr 26 '21 at 12:47
  • I realised another possible approach: using a prefix trie which gives paths by which the set evolves, tracking frequencies per node for recurring substrings, and splitting nodes when items are removed from the set. Will try comparing with this... – Arathorn Apr 26 '21 at 12:54
  • @Arathorn I realized later that it would be simpler to just have a buffer of varint-encoded deltas and use the index to track the start and end byte of each set (= the numbers that appear an odd number of times after delta decoding). – David Eisenstat Apr 26 '21 at 12:59
  • No I wasn't suggesting symmetric differences. This is a much better idea for this data. – Mark Adler Apr 27 '21 at 08:16
  • I haven't had a chance to try my signature-trie idea yet, plus am sure David's solution gets the best spatial compression and may well be faster for decompression too given the small data footprint, even with more processing steps. So am accepting this one - thanks! – Arathorn Apr 30 '21 at 12:20