Optimizing cohort analysis on Google BigQuery

Question

I'm attempting to perform a cohort analysis on a very large table. I have a test table with ~30M rows (over double in production). The query fails in BigQuery stating "resources exceeded.." and it's a tier 18 query (tier 1 is $5, so it's a $90 query!)

The query:

with cohort_active_user_count as (
  select 
    DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
    count(distinct`BQ_TABLE`.bot_user_id) as count,
    `BQ_TABLE`.bot_id as bot_id
  from `BQ_TABLE`
  group by created_at, bot_id
)

select created_at, period as period,
  active_users, retained_users, retention, bot_id
from (
  select 
    DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
    DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(`BQ_TABLE`.created_at, '-05:00'), DAY) as period,
    max(cohort_size.count) as active_users, -- all equal in group
    count(distinct future_message.bot_user_id) as retained_users,
    count(distinct future_message.bot_user_id) / max(cohort_size.count) as retention,
    `BQ_TABLE`.bot_id as bot_id
  from `BQ_TABLE`
  left join `BQ_TABLE` as future_message on
    `BQ_TABLE`.bot_user_id = future_message.bot_user_id
    and `BQ_TABLE`.created_at < future_message.created_at
    and TIMESTAMP_ADD(`BQ_TABLE`.created_at, interval 720 HOUR) >= future_message.created_at
    and `BQ_TABLE`.bot_id = future_message.bot_id 
  left join cohort_active_user_count as cohort_size on 
    DATE(`BQ_TABLE`.created_at, '-05:00') = cohort_size.created_at
    and `BQ_TABLE`.bot_id = cohort_size.bot_id 
  group by 1, 2, bot_id) t
where period is not null
and bot_id = 80
order by created_at, period, bot_id

Here is the desired output:

From my understanding of BigQuery, the joins are causing a major performance hit because each BigQuery node needs to process them. The table is partitioned by day, which I'm not yet making use of in this query, but I know it will still need to be optimized.

How can this query be optimized or exclude the use of joins to allow BigQuery to process more efficiently in parallel?

Do you have a job ID for the failed query? A BigQuery engineer may be able to give suggestions on how to optimize it. — Elliott Brossard, Oct 20 '16 at 15:48

Mikhail Berlyant · Accepted Answer · 2016-10-24T04:17:57.377

Step #1

Try below
Moved JOIN'ing on cohort_active_user_count outside the inner SELECT as I think it is one of main reason for query be expensive. And as you see - using JOIN instead LEFT JOIN for this one as LEFT is not needed here

Please test and let us know result

WITH cohort_active_user_count AS (
  SELECT 
    DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
    COUNT(DISTINCT BQ_TABLE.bot_user_id) AS COUNT,
    BQ_TABLE.bot_id AS bot_id
  FROM BQ_TABLE
  GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
  cohort_size.count AS active_users, retained_users, 
  retained_users / cohort_size.count AS retention, t.bot_id
FROM (
  SELECT 
    DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
    DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(BQ_TABLE.created_at, '-05:00'), DAY) AS period,
    COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
    BQ_TABLE.bot_id AS bot_id
  FROM BQ_TABLE
  LEFT JOIN BQ_TABLE AS future_message 
    ON BQ_TABLE.bot_user_id = future_message.bot_user_id
    AND BQ_TABLE.created_at < future_message.created_at
    AND TIMESTAMP_ADD(BQ_TABLE.created_at, interval 720 HOUR) >= future_message.created_at
    AND BQ_TABLE.bot_id = future_message.bot_id 
  GROUP BY 1, 2, bot_id
  HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size 
  ON t.created_at = cohort_size.created_at
  AND t.bot_id = cohort_size.bot_id 
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id

Step # 2

Below "further optimization" is based on assumption that your BQ_TABLE is a raw data with multiple entries for the same user_id/bit_id for the same day - thus increasing a lot expenses of LEFT JOIN in inner SELECT.
I propose first to aggregate this, as it is done below. In addition to drastically reducing size of JOIN - it also eliminates all those conversion from TIMESTAMP to DATE in each join'ed row

WITH BQ_TABLE_AGG AS (
  SELECT bot_id, bot_user_id, DATE(BQ_TABLE.created_at, '-05:00') AS created_at
  FROM BQ_TABLE
  GROUP BY 1, 2, 3
),
cohort_active_user_count AS (
  SELECT 
    created_at,
    COUNT(DISTINCT bot_user_id) AS COUNT,
    bot_id AS bot_id
  FROM BQ_TABLE_AGG
  GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
  cohort_size.count AS active_users, retained_users, 
  retained_users / cohort_size.count AS retention, t.bot_id
FROM (
  SELECT 
    BQ_TABLE_AGG.created_at AS created_at,
    DATE_DIFF(future_message.created_at, BQ_TABLE_AGG.created_at, DAY) AS period,
    COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
    BQ_TABLE_AGG.bot_id AS bot_id
  FROM BQ_TABLE_AGG
  LEFT JOIN BQ_TABLE_AGG AS future_message 
    ON BQ_TABLE_AGG.bot_user_id = future_message.bot_user_id
    AND BQ_TABLE_AGG.created_at < future_message.created_at
    AND DATE_ADD(BQ_TABLE_AGG.created_at, INTERVAL 30 DAY) >= future_message.created_at
    AND BQ_TABLE_AGG.bot_id = future_message.bot_id 
  GROUP BY 1, 2, bot_id
  HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size 
  ON t.created_at = cohort_size.created_at
  AND t.bot_id = cohort_size.bot_id 
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id

This works excellent, thanks! I had figured out using an aggregate was the way to go, but additionally removing the timestamp conversion reduced the query from ~60 secs to ~11 secs. Once I utilize the partition time, it should be pretty speedy. — mnort9, Oct 26 '16 at 14:32

score 0 · Answer 2 · answered Oct 20 '16 at 18:47

If you don't want to enable a higher billing tier given the costs, here are a couple of suggestions that might help to reduce the CPU requirements:

Use INNER JOINs rather than LEFT JOINs if you can. INNER JOINs should generally be less CPU-intensive, but then again you won't get unmatched rows like you would with LEFT JOINs.
Use APPROX_COUNT_DISTINCT(expr) instead of COUNT(DISTINCT expr). You won't get an exact count, but it's less CPU-intensive and may be "good enough" depending on your needs.

You could also consider manually breaking the query into stages of computation, e.g. write the WITH clause statement to a table, then use that in the subsequent query. I don't know what the specific cost tradeoffs would be, though.

score 0 · Answer 3 · answered Oct 23 '16 at 18:52

Why is it tagged MySQL?

In MySQL, I would change

max(cohort_size.count) as active_users, -- all equal in group

to

( SELECT max(count) FROM cohort_active_user_count WHERE ... ) as active_users,

and remove the JOIN to that table. Without doing this, you risk inflating the COUNT(...) values!

Also move the division to get retention into the outside query.

Once you have done that, you can also move the other JOIN into a subquery:

( SELECT count(distinct future_message.bot_user_id)
    FROM ... WHERE ... ) as retained_users,

I would have these indexes. Note that created_at needs to be last.

cohort_active_user_count:  INDEX(bot_id, created_at)
future_message: (bot_id, bot_user_id, created_at)

Optimizing cohort analysis on Google BigQuery

3 Answers3