What algorithm is used to achieve APPROX_TOP_COUNT in BigQuery？

Question

BigQuery says that Approximate aggregate functions are scalable in terms of memory usage and time, but produce approximate results instead of exact results.

I can't find any function like this in drill or hive. With the cluster computing,we can easily get the exact results,why and when we should use this APPROX_FUNC? I also hope someone can tell me what algorithm is used in APPROX_TOP_COUNT?

score 2 · Answer 1 · answered Mar 06 '17 at 13:03

One example of where approximate functions can be useful is analysis of Firebase event logs (there are a number of questions about BigQuery/Firebase on StackOverflow). If you just want to know the top 10 most visited pages, for instance, you could use APPROX_TOP_COUNT to perform the analysis, which would usually be faster than COUNT(*) with GROUP BY and ORDER BY ... LIMIT ....

From an implementation standpoint, you can imagine that if you are only interested in the top 10 most visited pages, it probably isn't necessary to keep state in memory for the long tail of infrequently-visited pages since it's just going to be discarded later.

You can read about approximate algorithms in papers such as:

Thx for your answer，I have used bigquery test the APPROX_TOP_COUNT on bigquery's publicdata(1,108,779,463 row ).The result is APPROX_TOP_COUNT and COUNT(*) with GROUP BY and ORDER BY ... LIMIT both spent 2s. Can you tell me some examples about the performance of these two queries? — cxco, Mar 07 '17 at 04:25

What algorithm is used to achieve APPROX_TOP_COUNT in BigQuery？

1 Answers1