1

I should solve a count-distinct problem in Redis without the use of HyperLogLog (because of the 0.81% of known error).

I got different requests with a list of objects [O1, O2, ... On] for a specific Key A. For each list of objects received, Redis should memorize the Objects not still saved and return the number of new objects saved.

For Example:

  • Request 1 : Key: A - Objects: [O1, O2, O3] -> Response 1: Number of new objects : 3
  • Request 2 : Key: A - Objects: [O1, O2, O4] -> Response 2: Number of new objects : 1
  • Request 3 : Key: A - Objects: [O1, O2, O4] -> Response 3: Number of new objects : 0

I have tried to solve this problem with the Hyper Log Log and it's working perfectly but with a growing dataset of objects, the number of new objects saved is not so accurate. With the sets and the hashmap, the memory is growing too much.

I have read some stuff about Bitmaps but is not too clear. Do you have any reference to projects that are already facing this problem?

Thanks in advance

lordav
  • 105
  • 1
  • 2
  • 10
  • Which Redis data command are you currently using? Is there a problem with SADD? – Frank Yellin Jan 28 '22 at 07:22
  • @FrankYellin a SET is not recommended when the dataset is too big. I have tried with it but the memory is growing too much when the number of objects is about 100k – lordav Jan 28 '22 at 07:50
  • Any suggestion ? – lordav Jan 28 '22 at 14:39
  • I don't think you have a choice. Either you keep track of the elements you've seen (SET), or you use a probabilistic scheme (FPADD) which doesn't guarantee the right answer, and you've also rejected. – Frank Yellin Jan 29 '22 at 01:27

1 Answers1

2

You might want to consider using a bloom filter. This is available as a module https://redis.com/redis-best-practices/bloom-filter-pattern/.

Bloom filters allow quick tests for membership with 0 false negatives and a very low false positive, provided you know in advance what the maximum number of elements are. You would need to write code of the sort:

result = bf.exists(key, item)
if result == 0:
    bf.add(key, item)
    bf.inc(key_count)
Frank Yellin
  • 9,127
  • 1
  • 12
  • 22
  • Hi @FrankYellin, thanks for your answer. As i said before, I need a way without errors to count distinct. Bloom filters has still (even if low) error chance. Maybe I've found a solution with Bitmaps – lordav Jan 30 '22 at 08:29
  • I'm not sure how bitmaps are going to help you. You still have the problem of two strings hashing to the same value. (Or hashing to the same value modulo the size of the bitmap.). You either have to remember every string, or accept the fact of occasionally collisions. Bloom filters do the best job of making collisions extremely unlikely. – Frank Yellin Jan 30 '22 at 19:08
  • Just FYI. I used a Bloom Filter to add 1,000,000 items, and requested an error rate of 0% (which is really 1 in a trillion). All million items were added without it ever losing count. – Frank Yellin Jan 31 '22 at 00:33
  • Thanks @FrankYellin. I'm trying to add 22000, 180000 and 36000 object for different keys but still having errors (even if the % rate is lower). – lordav Jan 31 '22 at 16:09
  • This is the syntax with Blooms filter : "BF.INSERT key CAPACITY 10000 ERROR 0.0001 ITEMS x x1 ..." How are you tuning the optional parameters ? – lordav Jan 31 '22 at 16:23
  • And note that you can set the error rate *very* *very* low, like 1e-10. – Frank Yellin Jan 31 '22 at 17:40
  • Thanks! This is a great solution – lordav Feb 01 '22 at 14:21