2

I need help formulating a cohort/retention query

I am trying to build a query to look at visitors who performed ActionX on their first visit (in the time frame) and then how many days later they returned to perform Action X again

The output I (eventually) need looks like this...

screen

The table I am dealing with is an export of Google Analytics to BigQuery

If anyone could help me with this or anyone who has written a query similar that I can manipulate?

Thanks

Shaz
  • 39
  • 1
  • 3
  • Hello and welcome to Stack Overflow, please take a time to go through the [welcome tour](https://stackoverflow.com/tour) to know your way around here (and also to earn your first badge), read how to create a [mcve] example and also check [ask] so you increase your chances to get feedback and useful answers. – garfbradaz Aug 01 '17 at 15:24

4 Answers4

2

Just to give you simple idea / direction

Below is for BigQuery Standard SQL

#standardSQL
SELECT 
  Date_of_action_first_taken,
  ROUND(100 * later_1_day / Visits) AS later_1_day,
  ROUND(100 * later_2_days / Visits) AS later_2_days,
  ROUND(100 * later_3_days / Visits) AS later_3_days
FROM `OutputFromQuery`  

You can test it with below dummy data from your question

#standardSQL
WITH `OutputFromQuery` AS (
  SELECT '01.07.17' AS Date_of_action_first_taken, 1000 AS Visits, 800 AS later_1_day, 400 AS later_2_days, 300 AS later_3_days UNION ALL
  SELECT '02.07.17', 1000, 860, 780, 860 UNION ALL
  SELECT '29.07.17', 1000, 780, 120, 0 UNION ALL
  SELECT '30.07.17', 1000, 710, 0, 0
)
SELECT 
  Date_of_action_first_taken,
  ROUND(100 * later_1_day / Visits) AS later_1_day,
  ROUND(100 * later_2_days / Visits) AS later_2_days,
  ROUND(100 * later_3_days / Visits) AS later_3_days
FROM `OutputFromQuery`  

The OutputFromQuery data is as below:

Date_of_action_first_taken  Visits  later_1_day later_2_days    later_3_days  
01.07.17                    1000    800         400             300  
02.07.17                    1000    860         780             860  
29.07.17                    1000    780         120             0    
30.07.17                    1000    710         0               0    

and the final output is:

Date_of_action_first_taken  later_1_day later_2_days    later_3_days     
01.07.17                    80.0        40.0            30.0     
02.07.17                    90.0        78.0            86.0     
29.07.17                    80.0        12.0            0.0  
30.07.17                    70.0        0.0             0.0  
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Thankyou Mikhail! This helped give me a flavour. I have put my query (or where i've got to above) if you are able to check that out and let me know what you think? Thankyou for your response! – Shaz Aug 02 '17 at 14:10
2

I found this query on Turn Your App Data into Answers with Firebase and BigQuery (Google I/O'19)

It should work :)

#standardSQL

###################################################
# Part 1: Cohort of New Users Starting on DEC 24
###################################################
WITH 
new_user_cohort AS (
  SELECT DISTINCT
    user_pseudo_id as new_user_id
  FROM
    `[your_project].[your_firebase_table].events_*`
  WHERE
    event_name =  `[chosen_event] ` AND
    #set the date from when starting cohort analysis
    FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+1")) = '20191224' AND
    _TABLE_SUFFIX BETWEEN '20191224' AND '20191230'
),

num_new_users AS (
  SELECT count(*) as num_users_in_cohort FROM new_user_cohort
),

#############################################
# Part 2: Engaged users from Dec 24 cohort
#############################################
engaged_users_by_day AS (
  SELECT
    FORMAT_TIMESTAMP("%Y%m%d", TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY, "Etc/GMT+1")) as event_day,
    COUNT(DISTINCT user_pseudo_id) as num_engaged_users
  FROM
    `[your_project].[your_firebase_table].events_*`
  INNER JOIN
    new_user_cohort ON new_user_id = user_pseudo_id
  WHERE
    event_name = 'user_engagement' AND
    _TABLE_SUFFIX BETWEEN '20191224' AND '20191230'
  GROUP BY
    event_day
)


####################################################################
# Part 3: Daily Retention = [Engaged Users / Total Users]
####################################################################
SELECT
  event_day,
  num_engaged_users,
  num_users_in_cohort,
  ROUND((num_engaged_users / num_users_in_cohort), 3) as retention_rate
FROM
  engaged_users_by_day
CROSS JOIN
  num_new_users
ORDER BY
  event_day
totooooo
  • 1,050
  • 1
  • 12
  • 32
  • 1
    This query is right but it will only give you an analysis of the cohort that did an action on the 24th. In his question, the user asked for something that would give him the analysis for several cohorts at the same time. You can maybe improve this query by adding the date in subqueries of step 1, and when joining in engaged_users_by_day, including the dates of the cohorts of engaged users (from new_user_cohort). – totooooo Jan 14 '20 at 21:12
1

So I think I may have cracked it... from this output I then would need to manipulate it (pivot table it) to make it look like the desired output.

Can anyone review this for me and let me know what you think?

`WITH
cohort_items AS (
SELECT 
MIN( TIMESTAMP_TRUNC(TIMESTAMP_MICROS((visitStartTime*1000000 + 
h.time*1000)), DAY) ) AS cohort_day, fullVisitorID
FROM
TABLE123 AS U,
UNNEST(hits) AS h
WHERE _TABLE_SUFFIX BETWEEN "20170701" AND "20170731"
AND 'ACTION TAKEN'
GROUP BY 2
),


user_activites AS (
SELECT
A.fullVisitorID,
DATE_DIFF(DATE(TIMESTAMP_TRUNC(TIMESTAMP_MICROS((visitStartTime*1000000 + h.time*1000)), DAY)), DATE(C.cohort_day), DAY) AS day_number 
FROM `Table123` A

LEFT JOIN cohort_items C ON A.fullVisitorID = C.fullVisitorID,
UNNEST(hits) AS h

WHERE
A._TABLE_SUFFIX BETWEEN "20170701 AND "20170731"

AND 'ACTION TAKEN'
GROUP BY 1,2),

cohort_size AS (
SELECT 
cohort_day,
count(1) as number_of_users
FROM 
cohort_items
GROUP BY 1
ORDER BY 1
),

retention_table AS (
SELECT
C.cohort_day,
A.day_number,
COUNT(1) AS number_of_users
FROM
user_activites A

LEFT JOIN cohort_items C ON A.fullVisitorID = C.fullVisitorID
GROUP BY 1,2
)


SELECT
B.cohort_day,
S.number_of_users as total_users,
B.day_number,
B.number_of_users  /  S.number_of_users as percentage
FROM retention_table B

LEFT JOIN cohort_size S ON B.cohort_day = S.cohort_day

WHERE B.cohort_day IS NOT NULL
ORDER BY 1, 3
`

Thank you in advance!

Shaz
  • 39
  • 1
  • 3
1

If you use some techniques available in BigQuery, you can potentially solve this type of problem with very cost and performance effective solutions. As an example:

SELECT
  init_date,
  ARRAY((SELECT AS STRUCT days, freq, ROUND(freq * 100 / MAX(freq) OVER(), 2) FROM UNNEST(data) ORDER BY days)) data
FROM(
  SELECT
  init_date,
  ARRAY_AGG(STRUCT(days, freq)) data
FROM(
  SELECT
    init_date,
    data AS days,
    COUNT(data) freq
FROM(
  SELECT
    init_date,
    ARRAY(SELECT DATE_DIFF(PARSE_DATE("%Y%m%d", dts), PARSE_DATE("%Y%m%d", init_date), DAY) AS dt FROM UNNEST(dts) dts) data
  FROM(
    SELECT 
      MIN(date) init_date,
      ARRAY_AGG(DISTINCT date) dts
    FROM `Table123`
    WHERE TRUE
    AND EXISTS(SELECT 1 FROM UNNEST(hits) where eventinfo.eventCategory = 'recommendation') -- This is your 'ACTION TAKEN' filter
    AND _TABLE_SUFFIX BETWEEN "20170724" AND "20170731"
    GROUP BY fullvisitorid
    )
    ),
    UNNEST(data) data
    GROUP BY init_date, days
   )
  GROUP BY init_date
)

I tested this query against our G.A data and selected customers who interacted with our recommendation system (as you can see in the filter selection WHERE EXISTS...). Example of result (omitted absolute values of freq for privacy reasons):

enter image description here

As you can see, at day 28th for instance, 8% of customers came back 1 day later and interacted with the system again.

I recommend you to play around with this query and see if it works well for you. It's simpler, cheaper, faster and hopefully easier to maintain.

Willian Fuks
  • 11,259
  • 10
  • 50
  • 74