1

There are so many Hyperloglog implementation out there, but how do you verify / test Hyperloglog implementation? To check it's "accuracy", it's "error" bound behavior? Just throwing some static test cases looks very ineffective.

More concrete, someone changes the random number routine, how do I know that is not a disastrous choice and show with some automated, repeatable tests?

Can anyone point me to any known good tests in github or other place, and may be some explanations?

ETOMG
  • 11
  • 3

1 Answers1

1

Good question. First, note that while HyperLogLog's theoretical foundation offers some indication of accuracy, it is critical to test the implementation you are using.

Testing should use random datasets (additional static datasets are also possible), and should be applied across varying set cardinalities. If you have any test automation framework in place, that would be a natural place to ensure avoiding regression, as you suggested above. However, note that to measure accuracy with large cardinalities, test runtime might be prohibitive.

You can use the implementation below for reference. It includes unit tests which draw large numbers of random numbers, and check the accuracy at fixed intervals.

https://github.com/Microsoft/CardinalityEstimation

OronNavon
  • 1,293
  • 8
  • 18