2

The precision of using HLL.INIT(...) and HLL.MERGE(...) is described here:

https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions

However, I'm wondering if there is ever a cardinality size, under which point HLL is guaranteed to be accurate. For example, if I have 10 distinct values, and I'm "merging" them together in various areas, will this be 100% accurate every time, or will it be subject to the HLL error range? In other words, is there a guaranteed upper bound for each level of HLL Precision, under which there is 100% guaranteed accuracy? If so, how can that be determined for each Precision level?

David542
  • 104,438
  • 178
  • 489
  • 842
  • check this out: [HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm](https://ai.google/research/pubs/pub40671) - you will need to download pdf from there – Mikhail Berlyant May 24 '19 at 18:22
  • @MikhailBerlyant Yes, I started reading that, but I still wasn't sure "where it's exact", if ever – David542 May 24 '19 at 18:56
  • i afraid to speculate - but here we go - I doubt there is a positive answer on your question. I provided link to that document just in case you wanted to read more and have more insight on Google's HLL+ . Obviously great question (+1)! Hope someone from Google Team can shed light on this – Mikhail Berlyant May 24 '19 at 19:22
  • I'm wondering why it be important to know the exact behavior for this case. – Felipe Hoffa May 24 '19 at 22:40
  • @FelipeHoffa well if it's never actually correct I wouldn't use it (unless there's a memory constraint elsewhere that I cannot get around) – David542 May 24 '19 at 23:27
  • Why would you never use it? You're trading performance for precision. – Felipe Hoffa May 25 '19 at 01:39

1 Answers1

1

if I have 10 distinct values, and I'm "merging" them together in various areas, will this be 100% accurate every time, or will it be subject to the HLL error range?

For the same input the output should be the same. In practice, for the same input BigQuery can use cached results, so, if it was 100% accurate once, it will be every time.

is there a guaranteed upper bound for each level of HLL Precision, under which there is 100% guaranteed accuracy? If so, how can that be determined for each Precision level?

No, HLL++ provides estimates. There isn't a 100% accuracy guaranteed for any cardinality sizes and/or Precision levels. If you are looking for 100% accuracy use COUNT(DISTINCT ...)

Tlaquetzal
  • 2,760
  • 1
  • 12
  • 18
  • 1
    There is only one case where HLL++ is accurate: it can accurately say that there are _no_ entries (that the result is zero). Any other value (1 or higher) is an estimate. For example, if it reports one, in reality it could be more than one. (Actually at such low cardinalities linear counting is used, but this is only an implementation detail.) – Thomas Mueller Oct 05 '19 at 19:28