redis HLL too many false positives

Question

Hyperlog log is a probablistic algorithm According to the redis HLL document , we could get 0.81% of error but I get errors like 17-20%

I think there is something wrong .. This is my simple perl test script. Is there some error

#!/usr/bin/perl -w                                                                                                                                                       
use Redis;
my $redis = Redis->new(server=>'192.168.50.166:6379') or die;
my $fp=0;
my $HLL="HLL";

$redis->del($HLL);
foreach my $i (1..10000) {
  my $s1 = $redis->pfadd($HLL,$i);
  if($s1 == 0){ 
    print "False positive on $i\n";
    $fp++;
  }
}
print "count of false positives $fp\n";

Isn't hyperloglog about counting unique things, and you're counting the same thing over and over? — Sobrique, Mar 21 '17 at 10:30

score 5 · Accepted Answer · answered Mar 21 '17 at 12:05

5

HyperLogLog is used for counting unique items. It can count a large number of items with a little memory. However, the returned cardinality is NOT exact, but approximated with a standard error.

0.81% is the standard error, NOT the false positive. For your instance, you can call PFCOUNT HLL to get the approximated number of unique items you put into the HyperLogLog. The returned number should be in range of [10000 * (1 - 0.81%), 10000 * (1 + 0.81%)].

PFADD returns 1 if the estimated cardinality is changed after executing the command. It returns 0, otherwise. It has nothing to do with false positive.

It seems what you need is a Bloom Filter, which can tell you if an item already exists in a data set, with false positive. You can implement a Bloom Filter with Redis, of course. And there should be some open source project for that.

answered Mar 21 '17 at 12:05

for_stack

21,012
4
35
48

Yes , I need a Bloom Filter, but as a server. Because I have multiple – Ram Mar 22 '17 at 06:05
Yes , I need a Bloom Filter, but as a server. Because I have multiple applications from different servers. All of them need to check if a particular element exists in a set or not. This has to be very efficient and the sets will be persistent for 2-3 months Is there a ready Bloom filter server I can use – Ram Mar 22 '17 at 06:12
1

@Ram As I mentioned, you can use Redis as the backend server to implement the bloom filter. You can google for open source projects. Also, it's NOT hard to implement one by yourself with lua scripting and the `GETBIT` and `SETBIT` commands. – for_stack Mar 22 '17 at 08:35
I have been looking at implementations at redislabs cuckoo filters seem to be better than bloom filters for existence checks. My question : is using in memory filters a good idea for persistent data storage. The application needs that if a particular user is sent a communication then he must not be sent the same category for 90 days – Ram Mar 22 '17 at 15:52
@Ram Redis support persistence. You can use `RDB`, `AOF` or both to write data to disk, although the persistence might lose a little data if Redis shutdown unexpected. I think you can tolerate some error, since you're using bloom filter or cuckoo filter. So that won't be a problem. Check the documentation for details on Redis persistence. – for_stack Mar 22 '17 at 16:20
what's the meaning of 0.81%? I don't think it like this: [10000 * (1 - 0.81%), 10000 * (1 + 0.81%)]. this is my question https://stackoverflow.com/questions/71523837/how-to-understand-that-the-standard-error-of-redis-hyperloglog-is-0-81. I'm confused now. I will be appreciate if you can give me an answer. – ming Mar 18 '22 at 07:58

redis HLL too many false positives

1 Answers1

Linked