1

I have a question that a little bit similar with question#66044663 but more complicated.

Here's my dummy data.

enter image description here

I want to get 3 adjacent actions(no duplicate) from the flag by each user.

Here's the chart to describe my thought.

enter image description here

Here's what I want:

enter image description here

How can I implement SQL(I use Google Bigquery)? I know the function LAG could be a solution but I have no idea how to avoid the duplicate actions.

Hope someone can light me up. Thanks a million!

Here's the code for generating the dataset.

WITH
src_table AS (
SELECT 'Jack' AS User, 1 AS Sequence, 'Eat' AS Action, '' AS Flag UNION ALL
SELECT 'Jack' AS User, 2 AS Sequence, 'Work' AS Action, '' AS Flag UNION ALL
SELECT 'Jack' AS User, 3 AS Sequence, 'Sleep' AS Action, 'Flag A' AS Flag UNION ALL
SELECT 'Jack' AS User, 4 AS Sequence, 'Exercise' AS Action, 'Flag B' AS Flag UNION ALL
SELECT 'Kenny' AS User, 1 AS Sequence, 'Run' AS Action, '' AS Flag UNION ALL
SELECT 'Kenny' AS User, 2 AS Sequence, 'Eat' AS Action, '' AS Flag UNION ALL
SELECT 'Kenny' AS User, 3 AS Sequence, 'Eat' AS Action, '' AS Flag UNION ALL
SELECT 'Kenny' AS User, 4 AS Sequence, 'Work' AS Action, 'Flag C' AS Flag UNION ALL
SELECT 'Kenny' AS User, 5 AS Sequence, 'Work' AS Action, 'Flag D' AS Flag UNION ALL
SELECT 'May' AS User, 1 AS Sequence, 'Work' AS Action, 'Flag A' AS Flag
)
LucasLee
  • 75
  • 6

3 Answers3

1

Consider below

select user, actions.action_sequence, flag  from (
  select *, (
    select as struct count(1) actions_count,
      string_agg(action, ' >> ' order by grp) action_sequence
    from (
      select action, grp from t.arr group by action, grp
    )) actions
  from (
    select *, array_agg(struct(action, grp)) 
      over(partition by user order by grp desc range between current row and 2 following) arr
    from (
      select *, countif(change) over(partition by user order by sequence) grp
      from (
        select *, action != lag(action) over(partition by user order by sequence) change
        from src_table
      )
    )
  ) t
)
where flag != '' 
and actions.actions_count = 3
# order by user, sequence

If to apply to sample data in your question - output is

enter image description here

NOTE: above solution works for any number of adjacent actions (no duplicate) - you just need to change it (2 and 3) in two respective places

over(partition by user order by grp desc range between current row and 2 following) arr    

and

and actions.actions_count = 3   
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • 1
    I like this answer because it's flexible and I can get right answer directly through the code you provide. What's more, I can learned a lot from the answer. RANGE statement is an advanced skill to me, so I spent much time to understand your code that's why I comment you answer so late. Thanks for your reply! – LucasLee Feb 07 '21 at 10:21
0

You can use RANK() to order your duplicates, then filter on RANK() = 1 to get the first (or last) of each duplicates. Then the problem is reduced to the one from the other question you refer to.

mstrzelc
  • 1
  • 2
  • Thanks for the reply. I am wondering how to rank the duplicate actions from flag C and D through the original code? – LucasLee Feb 05 '21 at 09:47
  • Would filtering on `sequence` not greater than the one corresponding to the given flag do? – mstrzelc Feb 05 '21 at 09:58
0

This is similar to your previous query. If I assume that adjacent rows with the same action have at most one flag, then we can use a gaps-and-islands approach . . . and then lag.

The first step is:

select user, min(sequence) as seqnuence, action, max(flag) as flag
from (select t.*,
             row_number() over (partition by user order by sequence) as seqnum
      from t
     ) t
group by user, sequence - seqnum;

Then, with this as the "base" data, we can use lags:

with cte as (
      select user, min(sequence) as seqnuence, action, max(flag) as flag
      from (select t.*,
                   row_number() over (partition by user order by sequence) as seqnum
            from t
           ) t
      group by user, sequence - seqnum
     )
select user, prev_action, prev_action_2, action, flag
from (select t.*,
             lag(action) over (partition by user order by sequence) as prev_action,
             lag(action, 2) over (partition by user order by sequence) as prev_action2
      from t
     ) t
where prev_action is not null;

If users with the same activity can have different flags, I would appreciate if you would ask a new question. In the new question, it would be helpful if you would include SELECT statements to generate the sample data being used.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • I encountered error when I am using the first section. The error message is: Column flag contains an aggregation function, which is not allowed in GROUP BY. BTW, I added the SELECT statements for your reference in my question :) – LucasLee Feb 05 '21 at 13:48
  • @LucasLee . . . `flag` should not have been in the `group by`. – Gordon Linoff Feb 05 '21 at 14:06