0

We use 'hits.product.v2ProductCategory' grouping by 'date' and 'fullVisitorId' to get the categories visited for each date*fullVisitorId combination. This includes duplicated categories.

Our goal is getting an aggregated variable over time range (grouping by 'fullVisitorId', it means just one row for each), with the number of unique categories (in one step, if possible).

Felipe Hoffa's solution of creating as many variables as categories (BigQuery: SPLIT() returns only one value) is not viable because of the high amount and dynamic nature of them.

The other option in how to do nested SQL select count, forces us to group by 'hits.product.v2ProductCategory' besides 'fullVisitorId' and 'date' in a subquery, giving an intermediate table with as many rows as categories for each date*fullVisitorId.

We've tried it using GROUP_CONCAT(UNIQUE) to get all the unique categories in a string and then LENGTH(REGEXP_REPLACE) to obtain the number of elements except the separator character.

Is there another more efficient approach?

Thanks in advance,
Lourdes Hernández

Community
  • 1
  • 1

1 Answers1

0

I understand you are trying to address the fact that in BigQuery Legacy SQL the COUNT(DISTINCT) returns approximate count (statistical approximation) and is not guaranteed to be exact

To count exact distinct values in BigQuery Legacy SQL you can use EXACT_COUNT_DISTINCT

Also note, in BigQuery Standard SQL COUNT(DISTINCT) returns exact count

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Thanks for your response, Mikhail. The issue it's not exactly about the approximation, but what kind of treatment apply to this variable to optimize the process in few steps and avoid to group by additional variables besides 'fullVisitorId'. Regards, Lourdes – Lourdes Hernández Jul 01 '16 at 06:59