0

I have raw data as below. With each line is the record of an transaction of user, and the month when they made the transaction
enter image description here


What I want is to calculate the number of user who made order in a month and the number of repeated user (RETENTION) from last month, then I can know how many % of user is repeated user.

The desired result should look like this enter image description here

How can I do it in bigquery?

Anh Hoang
  • 79
  • 8

2 Answers2

1

One way to do it is to do it is through a self-join with the same table and a 1-month delay. That way, we match user&month combinations with user&previous-month to see if it's a returning user. For example, using the 2M row public table bigquery-public-data.hacker_news.stories and a particular user:

enter image description here

Note that prev_month is null (we used LEFT OUTER JOIN) for 2014-02-01 as the user was not active during 2014-01-01. We are joining on author and lagged months with:

FROM authors AS a 
LEFT OUTER JOIN authors AS b
ON a.author = b.author
AND a.month = DATE_ADD(b.month, INTERVAL 1 MONTH)

Then we count a user as repeating if the previous month was not null:

COUNT(a.author) AS num_users,
COUNTIF(b.month IS NOT NULL) AS num_returning_users

Note that we do not use DISTINCT here as we already grouped by author and month combinations when defining orders as CTE. You might need to take this into account for other examples.

Full query:

WITH
  authors AS (
  SELECT
    author,
    DATE_TRUNC(DATE(time_ts), MONTH) AS month
  FROM
    `bigquery-public-data.hacker_news.stories`
  WHERE
    author IS NOT NULL
  GROUP BY 1,2)

SELECT
  *,
  ROUND(100*SAFE_DIVIDE(num_returning_users,
      num_users),2) AS retention
FROM (
  SELECT
    a.month,
    COUNT(a.author) AS num_users,
    COUNTIF(b.month IS NOT NULL) AS num_returning_users
  FROM
    authors AS a
  LEFT OUTER JOIN
    authors AS b
  ON
    a.author = b.author
    AND a.month = DATE_ADD(b.month, INTERVAL 1 MONTH)
  GROUP BY 1
  ORDER BY 1
  LIMIT 100)

and results snippet:

enter image description here

which are correct results, i.e. for 2007-03-01:

enter image description here

Performance is not too fancy but in this case we are selecting only the fields needed for the aggregated data so scanned data is low and execution time not too high (~5s).

An alternative is to use EXISTS() inside COUNTIF() instead of the join but it takes longer for me (~7s). Query

Guillem Xercavins
  • 6,938
  • 1
  • 16
  • 35
  • Super, it's exactly what I need. Thank you very much, cannot recognize that the LEFT JOIN and date_add can be used to solve my problem like this. – Anh Hoang Jan 13 '20 at 08:43
0

If you are just looking at the previous month, then do the following:

  • Convert the months to numbers.
  • Aggregate the data at the user/month level.

Then you can just use lag():

select month,
       count(*) as num_users,
       countif(prev_month_int = month_int - 1) as prev_num_users,
       countif(prev_month_int = month_int - 1) / count(*) as repeat_rate
from (select mu.*,
             lag(month_int) over (partition by userid order by month_int) as prev_month_int
      from (select month, userid, count(*) as num_orders,
                   cast(split(month, '-')[ordinal(1)] as int64) * 12 + cast(split(month, '-')[ordinal(2)] as int64) as month_int
            from t
            group by month, userid
           ) mu
     ) mu
group by month;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thank you for your help, though I have some problem using this method when the year change, ie Jan 2020 vs Dec 2019, just months in number form cause error. – Anh Hoang Jan 13 '20 at 08:45
  • @AnhHoang . . . The `month_int` column is designed to handle that situation. Isn't it consecutive with no gaps, regardless of year? – Gordon Linoff Jan 13 '20 at 11:21