Why is 1 added to the leading zero count in hyperloglog algorithm

Question

If there are k number of leading zeros in the bit pattern of hash, why is the estimate size considered to be 2^k+1? shouldn't it be 2^k ? the probability of having k leading zero should be 1/(2^k) and hence the size should be 2^k

In my code I always get correct estimation of size when I use k+1 instead of k. But I fail to understand the logic behind this.

score 2 · Answer 1 · answered Feb 14 '17 at 10:13

The intuition you're looking for is that the algorithm relies on the probability of seeing the entire bit pattern at the beginning of the hash (k zeros, followed by a 1), not just the zeros.

The more difficult part is getting from there to estimating the cardinality at 2^k+1. Unfortunately the formal proof of this isn't straightforward. In fact, most of the original original paper which introduced the method (Flajolet and Martin, Probabilistic counting Algorithms for Data Base Applications, http://algo.inria.fr/flajolet/Publications/FlMa85.pdf) is devoted to proving that the estimate computed with it is a good one. Subsequent papers (the LogLog and HyperLogLog papers) have similar proofs for their improved estimates.

Hope that helps!

score 1 · Answer 2 · answered Jul 29 '17 at 15:16

k leading zeros mean that the first k bits are zeros that are followed by a one bit. (Otherwise, we would have more than k leading zero bits.) Therefore, k leading zeros are actually characterized by a bit sequence of length (k+1), for which the probability is 1/2^(k+1).

Snives · Answer 3 · 2017-07-28T14:36:42.463

According to probability theory you are correct! You would expect to have made 2^k observations (on average) before having observed a value with k leading zeros.

The reason your estimate is double what it should be might be because your random function (or hashing function) is returning a signed int that is always positive and a leading zero is always present. This should approximately double your chances at seeing a value with k leading zeros. That is why you would get the correct answer when you use 2^k+1 instead of 2^k.

Why is 1 added to the leading zero count in hyperloglog algorithm

3 Answers3