0

I'm trying to use bloom/cuckoo filter to check if i already stored a tuple of 2 numbers f.e. (7,25), (47,1576), ... etc

In [12]: from cuckoo.filter import ScalableCuckooFilter                                                                                                                      

In [13]: c = ScalableCuckooFilter(initial_capacity=1000000, error_rate=0.0001)                                                                                               

In [14]: c.filters[0].buckets.buffer_info()                                                                                                                                  
Out[14]: (140042688086032, 8500000, 'big', 0, 8500000, 0, 0, 0)

In [15]: 8500000/(1024*1024)                                                                                                                                                 
Out[15]: 8.106 MB

to store the numbers i would need 2bytes per number, so 1mln * 4bytes = 3.8 Mb

So the filter is bigger than the data !

Is there a way or structure that is suitable for numbers with lower memory footprint ?

sten
  • 7,028
  • 9
  • 41
  • 63

1 Answers1

2

There is a lower bound on the size of approximate membership queries like cuckoo filters and Bloom filters. It is -log2(epsilon), which for you is 13.3 bits. You can get close to this by using bumped ribbon filters.

jbapple
  • 3,297
  • 1
  • 24
  • 38