I am wondering if it is possible to do an approximate distinct count in the following way:
- I have an aggregation like this:
+---------+----------------------+-------------------------------+
| country | unique products sold | helper_data -- limit 1MB size |
+---------+----------------------+-------------------------------+
| US | 100,000,005 | ?? |
| CA | 192,394,293 | ?? |
+---------+----------------------+-------------------------------+
- And I'm wondering if I can get the following:
+---------+--------------------------------------+
| country | unique products sold |
+---------+--------------------------------------+
| [ALL] | 205,493,599 # possible to get this?? |
| US | 100,000,005 |
| CA | 192,394,293 |
+---------+--------------------------------------+
In other words, without passing all the values (there are too many and I don't have enough memory to process it), could some sort of hash (or something else) be passed for each territory-specific line-item, to approximate what the approximate distinct count would be when added together between multiple items? Or is this not possible to do.
Note that I'm not looking for a sql approach, I'm only curious to see if its possible to pass some sort of object/hash/etc. back for each line-item and then build an approximate unique count across multiple line-items.