1

I am wondering if it is possible to do an approximate distinct count in the following way:

  1. I have an aggregation like this:
  +---------+----------------------+-------------------------------+
  | country | unique products sold | helper_data -- limit 1MB size |
  +---------+----------------------+-------------------------------+
  | US      | 100,000,005          | ??                            |
  | CA      | 192,394,293          | ??                            |
  +---------+----------------------+-------------------------------+
  1. And I'm wondering if I can get the following:
  +---------+--------------------------------------+
  | country |         unique products sold         |
  +---------+--------------------------------------+
  | [ALL]   | 205,493,599 # possible to get this?? |
  | US      | 100,000,005                          |
  | CA      | 192,394,293                          |
  +---------+--------------------------------------+

In other words, without passing all the values (there are too many and I don't have enough memory to process it), could some sort of hash (or something else) be passed for each territory-specific line-item, to approximate what the approximate distinct count would be when added together between multiple items? Or is this not possible to do.

Note that I'm not looking for a sql approach, I'm only curious to see if its possible to pass some sort of object/hash/etc. back for each line-item and then build an approximate unique count across multiple line-items.

Nimantha
  • 6,405
  • 6
  • 28
  • 69
David542
  • 104,438
  • 178
  • 489
  • 842
  • you are looking for [HyperLogLog++ Functions in Standard SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions) - it does exactly what you ask – Mikhail Berlyant May 24 '19 at 01:06
  • @MikhailBerlyant could you please show how this would be done -- doing the HLL within BigQuery, and then the count_merge outside of it (such as in python?) – David542 May 24 '19 at 01:08
  • i am not python person but can show you with sql (BigQuery SQL) – Mikhail Berlyant May 24 '19 at 01:09

1 Answers1

3

Below is simplified example for BigQuery Standard SQL that [I think] reproduces exactly your use case

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'us' country, 1 product_id UNION ALL
  SELECT 'us', 2 UNION ALL
  SELECT 'us', 3 UNION ALL
  SELECT 'us', 4 UNION ALL
  SELECT 'us', 5 UNION ALL
  SELECT 'ca', 3 UNION ALL
  SELECT 'ca', 4 UNION ALL
  SELECT 'ca', 5 UNION ALL
  SELECT 'ca', 6 UNION ALL
  SELECT 'ca', 7 UNION ALL
  SELECT 'ca', 8 UNION ALL
  SELECT 'ca', 9
), aggregation AS (
  SELECT country, 
    COUNT(DISTINCT product_id) unique_products_sold,
    HLL_COUNT.INIT(product_id) AS helper_data
  FROM `project.dataset.table`
  GROUP BY country
)
SELECT country, unique_products_sold FROM aggregation UNION ALL
SELECT 'all', HLL_COUNT.MERGE(helper_data) FROM aggregation 

with result

Row country unique_products_sold     
1   ca      7    
2   us      5    
3   all     9    

As you can see, this is quite simple query that you can use in whatever your preferred client - like python for example

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230