0

There's the HyperLogLog algorithm, but it is quite complex.

Is there any simpler space efficient approach that could be expressed in couple of lines of code?

Juan Lopes
  • 10,143
  • 2
  • 25
  • 44
Alex Craft
  • 13,598
  • 11
  • 69
  • 133

1 Answers1

2

HyperLogLog itself is quite simple, once you understand it (and already have a hash function).

I have implemented a didactic version in Python with 5 lines for element insertion and another 6 lines for estimation:

from mmhash import mmhash
from math import log
from base64 import b64encode

class HyperLogLog:
    def __init__(self, log2m):
        self.log2m = log2m
        self.m = 1 << log2m
        self.data = [0]*self.m
        self.alphaMM = (0.7213 / (1 + 1.079 / self.m)) * self.m * self.m

    def offer(self, o):
        x = mmhash(str(o), 0)
        a, b = 32-self.log2m, self.log2m
        i = x >> a
        v = self._bitscan(x << b, a)
        self.data[i] = max(self.data[i], v)

    def count(self):
        estimate = self.alphaMM / sum([2**-v for v in self.data])
        if estimate <= 2.5 * self.m:
            zeros = float(self.data.count(0))
            return round(-self.m * log(zeros / self.m))
        else:
            return round(estimate)

    def _bitscan(self, x, m):
        v = 1
        while v<=m and not x&0x80000000:
            v+=1
            x<<=1
        return v

The full implementation can be found here:

https://github.com/juanplopes/hyperloglog.py/blob/master/hyperloglog.py

Juan Lopes
  • 10,143
  • 2
  • 25
  • 44