10

Consider we have an algorithm that receives a hypothetically long stream of keys. It then generates a value between 0 and 1 for each key, as we process it, for posterior retrieval. The input set is large enough that we can't afford to store one value for each key. The value-generating rule is independent across keys.

Now, assume that we can tolerate error in the posterior lookup, but we want to still minimize the difference in retrieved and original values (i.e. asymptotically over many random retrievals).

For example, if the original value for a given key was 0.008, retrieving 0.06 is much better than retrieving 0.6.

What data structures or algorithms can we use to address this problem?

Bloom filters are the closest data structure that I can think of. One could quantize the output range, use a bloom filter for each bucket, and somehow combine their output at retrieval time to estimate the most likely value. Before I proceed with this path and reinvent the wheel, are there any known data structures, algorithms, theoretical or practical approaches to address this problem?

I am ideally looking for a solution that can parameterize the tradeoff between space and error rates.

Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
  • Can we do range partitioning and write a hash function to map every number to specific range. The values within range can be controlled based on error factor. – Ankur Shanbhag Nov 12 '15 at 21:10

1 Answers1

6

Perhaps a variant of the Bloom filter called Compact Approximator: like a bloom filter but generalized so the entries are values from a lattice. That lattice is here just floats between 0 and 1 (it has more structure than just being a lattice but it satisfies the requirements) or however you're storing those numbers.

An update replaces the relevant entries by the max between it and the value being remembered, a query computes the minimum of all its relevant entries (examples below). The results can only overestimate the true value. By reversing the ordering (swapping min and max and initializing to 1 instead of 0) you can get an underestimation, together giving an interval that contains the true value.


So for example, using the first approximated (overestimations), putting in a number looks like this:

index1 = hash1(key)
data[index1] = max(data[index1], value);
index2 = hash2(key)
data[index2] = max(data[index2], value);
... etc

And getting the overestimation looks like:

result = 1
index1 = hash1(key)
result = min(data[index1], result);
index2 = hash2(key)
result = min(data[index2], result);
... etc
harold
  • 61,398
  • 6
  • 86
  • 164
  • Beat me to it. Well played. – Louis Wasserman Nov 12 '15 at 21:33
  • Thanks @harold. Very helpful. I think an example for number retrieval would just make this perfect. Would you mind perhaps adding one? – Amelio Vazquez-Reina Nov 13 '15 at 00:17
  • Thanks! Reading the original paper it looks like one can use d-independent hash functions. (i.e. one uses "a d-dimensional, m-bucket compact approximator") Does `d` have to be = 2 in our case? What is the relationship? – Amelio Vazquez-Reina Nov 13 '15 at 01:49
  • @AmelioVazquez-Reina it doesn't have to be 2, the optimal number depends on the size of the table and the number and distribution of items placed into it. That paper doesn't address error in the sense that we have here so it may turn out a bit different, I'll investigate a bit – harold Nov 13 '15 at 10:05
  • @AmelioVazquez-Reina it turns out this is really terrible if we want to "overstuff" the table with more values than entries, no matter the d it very quickly reaches an average error worse than if it just guessed 0.5 for everything, at least if the values are uniform in [0..1]. Actually higher d was even worse, with d=1 being the best. So it turns out this is not a proper use case for compact approximators after all.. – harold Nov 13 '15 at 11:07
  • Thanks harold that makes sense. I'd be curious to know what you think @LouisWasserman – Amelio Vazquez-Reina Nov 13 '15 at 15:21