1

I'm querying a large data set to figure out if a bunch of campaign events (i.e. event 1,2,..) during different timepoints gives a result in user activity (active, inactive) during the following 3 days after each event (but not in the same day as the campaign event itself).

I'm merging two tables to do this, and they look like this merged:

| date       | user | events | day_activity  | 
| 2020-01-01 | 1    | event1 | active        | 
| 2020-01-01 | 2    | event1 | inactive      | 
| 2020-01-02 | 1    | null   | inactive      |  
| 2020-01-02 | 2    | null   | active        | 
| 2020-01-03 | 1    | null   | inactive      | 
| 2020-01-03 | 2    | null   | active        | 
| 2020-01-04 | 1    | null   | active        | 
| 2020-01-04 | 2    | null   | active        | 

What I am trying to achieve is, for each user/date/event gang (= row) where an event occured, to add another column called 3_day_activity, containing the activity not on the event (= current row) day but the following 3 days only (giving a score of 1 per active day). An example for how the 1st day of this table would look after (I add * in the activity days counted in the added column for user 1, and # for the events counted in the column for user 2)):

| date       | user | events | day_activity  | 3_day_activity
| 2020-01-01 | 1    | event1 | active        | 1
| 2020-01-01 | 2    | event1 | inactive      | 3
| 2020-01-02 | 1    | null   | inactive * (0)| null (bco no event)
| 2020-01-02 | 2    | null   | active # (1)  | null (bco no event)
| 2020-01-03 | 1    | null   | inactive * (0)| null (bco no event)
| 2020-01-03 | 2    | null   | active # (1)  | null (bco no event)
| 2020-01-04 | 1    | null   | active * (1)  | null (bco no event)
| 2020-01-04 | 2    | null   | active # (1)  | null (bco no event)

I tried solving this with a window function. It runs, but I think I misunderstood some important idea on how to design it, because the result contains a ton of repetitions...

  cm.date, 
  cm.user,
  event,
  day_activity,
  COUNTIF(active_today = 'active') OVER 3d_later AS 3_day_activity
FROM `customer_message` cm
INNER JOIN `customer_day` ud
  ON cm.user = ud.user
  AND cm.date = ud.date
WHERE 
  cm.date > '2019-12-25'
WINDOW 3d_later AS (PARTITION BY user ORDER BY UNIX_DATE(cm.date) RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING)

EDIT:

I was asked to supply an example of how this repetition might look. Here's what I see if I add an "ORDER BY 3_day_activity" clause at the end of the query:

Row    date         user  day_activity  3_day_activity  
1      2020-01-01   2     active        243
2      2020-01-01   2     active        243
3      2020-01-01   2     active        243
4      2020-01-01   2     active        243
5      2020-01-01   2     active        243
6      2020-01-01   2     active        243
7      2020-01-02   2     active        243
8      2020-01-02   2     active        243

EDIT2 :

This remains unsolved.. I have tried asking another question, as per the suggestion of one commenter, but I am locked from doing so even if the problem is not identical (I suppose due to the similarities to this one). I have tested grouping based on user and date, but I then it instead throws an error due to not aggregating in the 'COUNTIF' clause.

This is the attempt mentioned; SQL: Error demanding aggregation when counting, grouping and windowing

GMB
  • 216,147
  • 25
  • 84
  • 135
Galaffer
  • 171
  • 2
  • 13
  • 1
    What is column `day_index`? It is in the query, but it's not showing in your sample data. – GMB Sep 11 '20 at 21:00
  • My mistake: I was trying to create a smaller example by removing day_index, which is something I use along with the date. It is an integer that denotes number of days since first contact from customer, so if they contact us on 2020-01-01, then that's day_index 1, and 2020-01-03 would be day_index 3. I will correct it for the date so the example makes sense. – Galaffer Sep 11 '20 at 21:03
  • What do you mean by *the result contains a ton of repetitions*? Please show us the result you are getting. – GMB Sep 11 '20 at 21:03
  • I will add an answer with some output at the end of the question. – Galaffer Sep 11 '20 at 21:09

2 Answers2

1

You seem to be quite there. A range partition is the right way to go. BigQuery only supports integers in such frame, so we need to convert the date to a number; since you have dates with no time component, UNIX_DATE() comes to mind:

WINDOW 3d_later AS (
    PARTITION BY user 
    ORDER BY UNIX_DATE(cm.date) 
    RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)
GMB
  • 216,147
  • 25
  • 84
  • 135
  • Thank you, good to know I am somewhat on the right path. Hmm. As I wrote above, I was actually using an integer to count with. Perhaps the underlying problem could be in just that, what happens if a user gets a bunch of what I referred to as campaign events in one day (this happens if they trigger them with actions on rare occasions, and I have millions of rows so there's going to be some of those occasions). Perhaps I need to include further constraints in the `PARTITION BY` clause to avoid registering the same user as active multiple times in one day (and tallying those up to a huge sum)? – Galaffer Sep 11 '20 at 21:25
1

Below example is for BigQuery Standard SQL

#standardSQL
SELECT *, IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM `project.dataset.table`
WINDOW three_day_activity_window AS (
  PARTITION BY user 
  ORDER BY UNIX_DATE(date) 
  RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)

You can test, play with above using sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT DATE '2020-01-01' date , 1 user, 'event1' events, 'active' day_activity UNION ALL
  SELECT '2020-01-01', 2, 'event1', 'inactive' UNION ALL
  SELECT '2020-01-02', 1, NULL, 'inactive' UNION ALL
  SELECT '2020-01-02', 2, NULL, 'active' UNION ALL
  SELECT '2020-01-03', 1, NULL, 'inactive' UNION ALL
  SELECT '2020-01-03', 2, NULL, 'active' UNION ALL
  SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
  SELECT '2020-01-04', 2, NULL, 'active' 
)
SELECT *, IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM `project.dataset.table`
WINDOW three_day_activity_window AS (
  PARTITION BY user 
  ORDER BY UNIX_DATE(date) 
  RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)
ORDER BY date, user   

with output

Row date        user    events  day_activity    three_day_activity   
1   2020-01-01  1       event1  active          1    
2   2020-01-01  2       event1  inactive        3    
3   2020-01-02  1       null    inactive        0    
4   2020-01-02  2       null    active          0    
5   2020-01-03  1       null    inactive        0    
6   2020-01-03  2       null    active          0    
7   2020-01-04  1       null    active          0    
8   2020-01-04  2       null    active          0       

Update for - to avoid registering the same user as active multiple times in one day (and tallying those up to a huge sum)?

If you want to avoid counting all activity for user on same day - use below adjusted version (note extra entry in sample data to introduce user's multiple activity on same day)

#standardSQL
WITH `project.dataset.table` AS (
  SELECT DATE '2020-01-01' DATE , 1 user, 'event1' events, 'active' day_activity UNION ALL
  SELECT '2020-01-01', 2, 'event1', 'inactive' UNION ALL
  SELECT '2020-01-02', 1, NULL, 'inactive' UNION ALL
  SELECT '2020-01-02', 2, NULL, 'active' UNION ALL
  SELECT '2020-01-03', 1, NULL, 'inactive' UNION ALL
  SELECT '2020-01-03', 2, NULL, 'active' UNION ALL
  SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
  SELECT '2020-01-04', 1, NULL, 'active' UNION ALL
  SELECT '2020-01-04', 2, NULL, 'active' 
)
SELECT *, 
  IF(events IS NULL, 0, COUNTIF(day_activity = 'active') OVER(three_day_activity_window)) AS three_day_activity
FROM (
  SELECT date, user, MAX(events) events, MIN(day_activity) day_activity
  FROM `project.dataset.table` 
  GROUP BY date, user
)
WINDOW three_day_activity_window AS (
  PARTITION BY user 
  ORDER BY UNIX_DATE(date) 
  RANGE BETWEEN 1 FOLLOWING AND 3 FOLLOWING
)
ORDER BY date, user
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Thank you! After looking through thousands of rows, I think I am starting to understand what's going on, even if I didn't reach the answer as to how to solve it. It seems I am tallying the same day multiple times, so let's say we write to a customer 3 times in a day, and they reply (=are active). Then they're actually active for 1 day, but I'd tally that 3 times. It seems it also carries over days, so I end up counting up the activity of one customer over multiple days, probably forever, and summing that in the new 3_day_activity column.. Not what I intended.. – Galaffer Sep 11 '20 at 21:32
  • I think - the question as is - is answered. if it helped you to realized that you need to ask new question - please post the new one with all new details , sample data, etc. Meantime, if answered helped you - consider voting up :o) – Mikhail Berlyant Sep 11 '20 at 21:36
  • Sure, thanks. I'll upvote GMB's answer which helped me realise the question contained another element. I'll read up some more on how to avoid repeatedly tallying the same thing in this scenario if that's what's going on.. – Galaffer Sep 11 '20 at 21:38
  • 1
    see update to reflect your concern (based on how I understood it from your comments) – Mikhail Berlyant Sep 11 '20 at 21:44
  • Thank you! I will take some time to try and understand if I can solve using something like what you suggested. Regardless of that, this is very helpful as it gives me another idea on how to handle the issue. – Galaffer Sep 11 '20 at 21:54
  • we are not here to resolve your issues -this is not how SO works- we are here to answer your programming question! you ask - we answer! if question is answered and addresses the specific question (not the issue that triggered you to ask your question) - it is expected you also accept the answer. Then if you realize you still have question(s) - you post new question with specific details, samples of data and expected result - so we can answer your next question! hope this is clear. again - SO is not a forum where we discuss and resolve your issues/projects/homeworks - just you ask we answer – Mikhail Berlyant Sep 12 '20 at 20:27
  • In the strict interpretation of my originally faulty question, it was indeed answered by GMB, so the accurate thing would be to call it resolved there and ask another one which actually was about what I was wondering. Now I attempted to do that, but unfortunately it was marked as a duplicate. :) Thanks however for your guidance on SO. – Galaffer Sep 12 '20 at 20:34
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/221396/discussion-between-galaffer-and-mikhail-berlyant). – Galaffer Sep 12 '20 at 20:45