BigQuery to Data Studio : Show reliable COUNT DISTINCT regardless of the selected period

Question

in my BigQuery project I store event data integrated from Firebase. The granularity and dimension is such that trying to present raw data in Data Studio quickly makes the report become VERY slow (1-2 min per page/interaction).

I then started to think how I could create pre-aggregated tables in BigQuery to speed everything up, but quickly realised COUNT DISTINCT metrics would be a problem with this approach. Let me explain:

SELECT user, date
FROM UNNEST([
  STRUCT("Adam" AS user, "20190923" AS date),
  ("Bob", "20190923"),
  ("Carl", "20190923"),
  ("Adam", "20190924"),
  ("Bob", "20190924"),
  ("Adam", "20190925"),
  ("Carl", "20190925"),
  ("Bob", "20190926")
]) AS website_visits;

+------+----------+
| User |   Date   |
+------+----------+
| Adam | 20190923 |
| Bob  | 20190923 |
| Carl | 20190923 |
| Adam | 20190924 |
| Bob  | 20190924 |
| Adam | 20190925 |
| Carl | 20190925 |
| Bob  | 20190926 |
+------+----------+

The above is a table of website visits.

Clearly, creating a pre-aggregated table like

SELECT date, COUNT(DISTINCT user) FROM website_visits GROUP BY date

has the limitation that the count cannot be aggregated further (or even less, dinamically) to get a total, as doing a SUM would return 8 unique users which is not correct, there are only 3 unique users.

In BigQuery, this is fixed by using HLL_COUNT, which despite the approximation works ok for me.

Now to the big question:

How to do the same so that the result is displayable in Data Studio????

HLL_COUNT.EXTRACT is not available as function in there, and in the reporting I always have to keep in mind that the date range is set by the user however (s)he likes so it's not possible to store a pre-aggregated result for ALL cases...

EDIT 1: APPROX_COUNT_DISTINCT

As per answer from Bobbylank, I tried to use APPROX_COUNT_DISTINCT. However I found that this just seems to move the issue down the line. My fault for not explaining what's over there. Despite being performances acceptable it does not seem possible to me to blend a data source with this calculated metric.

Example: After displaying the amount of unique users in the selected period (which now works), I'm also trying to display Average Revenue Per User (ARPU) in Data Studio like Firebase does.

To do this, I have to SUM(REVENUE) / APPROX_COUNT_DISTINCT(USER)

Clearly, REVENUE works ok with pre-aggregation and is available in the raw data. I tried then to blend the raw data with a table containing just user visits. However APPROX_COUNT_DISTINCT can't be used in the blended data definition as calculated metrics are not allowed.

Even trying to use the USER field as a metric with Count Distinct aggregation, despite returning the correct figures when showing revenue and user count separately, when I try to divide them the problem becomes aggregation (apply SUM or AVG to the field and basically the result will be AVG(REVENUE/USERS) for each day).

I also then tried to store REVENUE directly in the visits table, but was reminded by Data Studio that I can't create calculated metrics that I can't mix dimensions and metrics in a calculated field.

score 0 · Answer 1 · answered Sep 26 '19 at 15:24

0

APPROX_COUNT_DISTINCT might be more performance friendly for you?

https://support.google.com/datastudio/answer/9189108?hl=en

Otherwise the only way I can think would be to pre-calculate several metrics (e.g. unique users on that day, 7-day cumulative, 14-day, etc.) as your customer require for each single day.

Or you could provide a 2 page report with both of these methods with the caveat that the first can be used over a time period but will be much slower?

answered Sep 26 '19 at 15:24

Bobbylank

1,906
7
15

1

agree with pre-calculating common timeframes. I would also consider requiring partitions to limit the data set for ad-hoc queries. – rtenha Sep 26 '19 at 23:21
hey! thanks for your reply. I tried this approach but still find myself at a dead end. Pre-calculating 28-day, 7-day etc is something I'm already doing but I feel like it takes freedom from the user to select the period that interests them – Giorgio Terreni Sep 27 '19 at 08:58
Fair enough. Have you looked into BI Engine to speed up your results? – Bobbylank Sep 27 '19 at 09:24
(see EDIT1 for my results on using APPROX_COUNT_DISTINCT) Have you worked with Firebase integration to BigQuery? The raw data is hard to use in DataStudio if you want to make heavy use (like me) of the nested fields (user properties and event parameters). To fix this I had to rely on a view that unnests the fields I need, but I can't use BI engine on a view... :-( – Giorgio Terreni Sep 27 '19 at 13:49
Not firebase but I've worked with similar data. Sounds like you'd benefit from creating a date-partitioned tables rather than views? – Bobbylank Sep 30 '19 at 07:08
1

in this specific case I would rather have partitioned tables on other fields, but I see that's not possible either if the fields is not a date/timestamp. I'm seeing positive results by creating semi-aggregated tables (i.e. keeping the user_id despite increasing a lot the cardinality) in combination with APPROX_COUNT_DISTINCT, so I believe I will pursue this solution – Giorgio Terreni Sep 30 '19 at 07:53

BigQuery to Data Studio : Show reliable COUNT DISTINCT regardless of the selected period

1 Answers1