Count number of events before certain events until another event is encountered in the same group in Big query?

Question

I want to count some certain values until a specific event occurred in SQL. This is very similar question to this question:

Count number of events before and after a event “A” till another event “A” is encountered in Big query?

The answer to this question didn't solve my problem which I am confused with RANGE BETWEEN 1 PRECEDING AND 1 PRECEDING too. Differently from that question I don't look for strings but I want to count other events.

My table would be like

   User       Event   Day
    1          C     2019-01-10
    1          B     2019-01-11
    1          D     2019-01-12
    1          A     2019-01-13
    2          D     2019-01-10
    2          B     2019-01-11
    2          C     2019-01-12
    2          D     2019-01-13
    2          A     2019-01-14
    2          E     2019-01-15

I would like to count C or D until event A or B occured.

I try COUNTIF(Event = 'C' OR Event = 'D') OVER(PARTITION BY User ORDER BY Day ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS count_events But this doesn't stop counting until event A or B. It counts all C or D events in partition.

My result table would look like this and the counting would stop if one of the event is occurred and restart counting again when the expected event occurs.

   User       Event   Day           count_events
    1          C     2019-01-10           0
    1          B     2019-01-11           1 
    1          D     2019-01-12           0 
    1          A     2019-01-13           1
    2          D     2019-01-10           0
    2          B     2019-01-11           1 
    2          C     2019-01-12           0
    2          D     2019-01-13           1 
    2          A     2019-01-14           2
    2          E     2019-01-15           0

I don't see how your your results are related to what you are asking. Why does 1/B have a count of "1" for instance? — Gordon Linoff, Oct 22 '19 at 16:26
I am confused, why not? As I stated I would like count of certain events occurrence until specific events and stop counting. When certain event occurs start counting again. — beginner, Oct 22 '19 at 16:28
Agree with Gordon - logic is not clear - can you please explain at least few counts from your expected output so we have better chance to help you — Mikhail Berlyant, Oct 22 '19 at 17:47
Sorry for not being able to explain it clearly. For each user, I want to count the occurrence of C or D events until A or B events occur. For example for one user; events are C D D A E A C B C D B, I would count A = 3 (because C and D events happened before A). then reset, there is no C or D event before second time A happened. Then B event happens counts one time C which makes B=1, when the second time B event occurs before that C and D happened again so count them the last B=2. I dont wanna sum over. I want to reset after each time A or B event happened. — beginner, Oct 22 '19 at 17:54
in this case - why the last line in your expected output for 2/E is 2? I would expect 0 instead. See my answer then — Mikhail Berlyant, Oct 22 '19 at 18:05

Mikhail Berlyant · Accepted Answer · 2019-10-22T20:22:26.127

1

Below is for BigQuery Standard SQL

#standardSQL
SELECT * EXCEPT(grp), 
  COUNTIF(event IN ('C', 'D')) 
  OVER(PARTITION BY user, grp ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) count_events
FROM (
  SELECT *, 
    COUNTIF(event IN ('A', 'B')) 
      OVER(PARTITION BY user ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) grp
  FROM `project.dataset.table`
)

if to apply to sample data in your question - result is

Row user    event   day         count_events     
1   1       C       2019-01-10  0    
2   1       B       2019-01-11  1    
3   1       D       2019-01-12  0    
4   1       A       2019-01-13  1    
5   2       D       2019-01-10  0    
6   2       B       2019-01-11  1    
7   2       C       2019-01-12  0    
8   2       D       2019-01-13  1    
9   2       A       2019-01-14  2    
10  2       E       2019-01-15  0

You can test, play with above using below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 user, 'C' event, DATE '2019-01-10' day UNION ALL
  SELECT 1, 'B', '2019-01-11' UNION ALL
  SELECT 1, 'D', '2019-01-12' UNION ALL
  SELECT 1, 'A', '2019-01-13' UNION ALL
  SELECT 2, 'D', '2019-01-10' UNION ALL
  SELECT 2, 'B', '2019-01-11' UNION ALL
  SELECT 2, 'C', '2019-01-12' UNION ALL
  SELECT 2, 'D', '2019-01-13' UNION ALL
  SELECT 2, 'A', '2019-01-14' UNION ALL
  SELECT 2, 'E', '2019-01-15' 
)
SELECT * EXCEPT(grp), 
  COUNTIF(event IN ('C', 'D')) 
  OVER(PARTITION BY user, grp ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) count_events
FROM (
  SELECT *, 
    COUNTIF(event IN ('A', 'B')) 
      OVER(PARTITION BY user ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) grp
  FROM `project.dataset.table`
)
-- ORDER BY user, day

I don't want to count last event C because it didn't occur before event A but after event A

Below is quick "fix"

#standardSQL
SELECT * EXCEPT(grp),   
  COUNTIF(event IN ('A', 'B')) OVER(PARTITION BY user, grp) *
  COUNTIF(event IN ('C', 'D')) 
    OVER(PARTITION BY user, grp ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) count_events
FROM (
  SELECT *, 
    COUNTIF(event IN ('A', 'B')) 
      OVER(PARTITION BY user ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) grp
  FROM `project.dataset.table`
)
-- ORDER BY user, day

If to apply to recent example you used - result is

Row user    event   day         count_events     
1   1       C       2019-01-10  0    
2   1       B       2019-01-11  1    
3   1       D       2019-01-12  0    
4   1       A       2019-01-13  1    
5   2       D       2019-01-10  0    
6   2       B       2019-01-11  1    
7   2       C       2019-01-12  0    
8   2       D       2019-01-13  1    
9   2       A       2019-01-14  2    
10  2       C       2019-01-15  0    
11  2       E       2019-01-16  0

edited Oct 22 '19 at 20:22

answered Oct 22 '19 at 18:09

Mikhail Berlyant

165,386
8
154
230

Thank you very much for the answer, I am confused with EXCEPT(grp) line, and then you use grp, what do you mean by grp? – beginner Oct 22 '19 at 18:56
1) grp is used to reset count as you want after any occurrence of A or B 2) EXCEPT(grp) just eliminates this intermediate field from final output. Hope this helps – Mikhail Berlyant Oct 22 '19 at 18:58
Is grp in SQL documentation? I didn't know it. With this solution I receive "Duplicate column names in the result are not supported. Found duplicate(s): user, event, day" error. How can I solve this? do I need to create two tables? – beginner Oct 22 '19 at 19:18
instead of `project.dataset.table` - put your own table and then run query as is! if you are not modifying anyhow provided query - there is no way to get the error you reference! and I am not sure why you would expect to see grp in documentation - thisis just simple alias of field - you could name it any way - like `abc` for example. – Mikhail Berlyant Oct 22 '19 at 19:20
Oh okay, I was not sure about usage of grp, now it is clearer. Of course I changed the table name. I don't know why I receive the error too. – beginner Oct 22 '19 at 19:23
most likely you accidentally modified / adjusted query - please check carefully and try again. I also added runnable example to answer - so you can test with dummy data – Mikhail Berlyant Oct 22 '19 at 19:25
Okay, I am sorry I have added columns after star which caused the error, I removed them and there is no error now. Thank you for your quick replies! – beginner Oct 22 '19 at 19:26
One problem though, if you add 10th row as user 2 event C, it counts events C and after event A or B occurred too. Is there a way to count them only before A and B events? Can it be because of UNBOUNDED PRECEDING AND 1 PRECEDING? – beginner Oct 22 '19 at 19:48
i am not sure what you mean, so far provided query produces result as per logic you described. and if I replace E to C in 10th row - it shows 0 as count for that row as expected – Mikhail Berlyant Oct 22 '19 at 19:52
Keep E as a last row and before that add the event C you will see count_event column will be 1 in the last row. – beginner Oct 22 '19 at 19:53
yap. as expected. each A or B resent counts so if i add 10th row as `2, 'C', '2019-01-15'` and change last (11th) row to `2, 'E', '2019-01-16'` i am getting 0 for 10th row and 1 for 11th row - as expected – Mikhail Berlyant Oct 22 '19 at 19:56
exactly, i am getting same - and it exactly follows the logic you provided! – Mikhail Berlyant Oct 22 '19 at 19:59
Okay, maybe the logic was not clear sorry I am not good at expressing which makes it difficult. I don't want to count last event C because it didn't occur before event A but after event A. – beginner Oct 22 '19 at 20:02
sorry, I couldn't see the update. Where did you change exactly? – beginner Oct 22 '19 at 20:19
looks like I missed clicking Save - check now – Mikhail Berlyant Oct 22 '19 at 20:23
Thank you very much! really appreciated all effort in a very short time, have a nice day! – beginner Oct 22 '19 at 20:36
sure. no problem. come back :o) but please be more specific with explaining your logic next time :o) – Mikhail Berlyant Oct 22 '19 at 20:37

Count number of events before certain events until another event is encountered in the same group in Big query?

1 Answers1