1

I need to group the data with 1 minute interval for the chain of actions. My data looks like this:

id    MetroId            Time             ActionName            refererurl
111     a          2020-09-01-09:19:00     First           www.stackoverflow/a12345
111     b         2020-09-01-12:36:54      First           www.stackoverflow/a12345
111     f         2020-09-01-12:36:56      First     www.stackoverflow/xxxx
111     b         2020-09-01-12:36:58      Midpoint        www.stackoverflow/a12345
111     f         2020-09-01-12:37:01      Midpoint    www.stackoverflow/xxx
111     b          2020-09-01-12:37:03     Third           www.stackoverflow/a12345
111     b          2020-09-01-12:37:09     Complete        www.stackoverflow/a12345
222     d          2020-09-01-15:17:44     First           www.stackoverflow/a2222
222     d          2020-09-01-15:17:48     Midpoint        www.stackoverflow/a2222
222     d          2020-09-01-15:18:05     Third           www.stackoverflow/a2222

I need to grab the data with the following condition: if x_id and x_url has Complete value for action_name column, grab that. If it doesn't have Complete then grab Third and so on.

  ARRAY_AGG(current_query_result 
    ORDER BY CASE ActionName
      WHEN 'Complete' THEN 1
      WHEN 'Third' THEN 2
      WHEN 'Midpoint' THEN 3
      WHEN 'First' THEN 4
    END
    LIMIT 1
  )[OFFSET(0)]
FROM
    (
        SELECT d.id, c.Time, c.ActionName, c.refererurl, c.MetroId
        FROM
            `bq_query_table_c` c
            INNER JOIN `bq_table_d` d ON d.id = c.CreativeId
        WHERE
            c.refererurl LIKE "https://www.stackoverflow/%"
            AND c.ActionName in ('First', 'Midpoint', 'Third', 'Complete')
    ) current_query_result
GROUP BY
    id,
    refererurl,
    MetroId 
    TIMESTAMP_SUB(
    PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S', time), 
    INTERVAL MOD(UNIX_SECONDS(PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S', time)), 1 * 60) 
    SECOND
  ) 

Desired output:

id   MetroId         Time             ActionName            refererurl
111      a     2020-09-01-09:19:00     First           www.stackoverflow/a12345
111     f     2020-09-01-12:37:01      Midpoint    www.stackoverflow/xxx
111     b     2020-09-01-12:37:09     Complete        www.stackoverflow/a12345
222     c      2020-09-01-15:18:05     Third           www.stackoverflow/a2222
Chique_Code
  • 1,422
  • 3
  • 23
  • 49
  • so for example if there are 10 actions each 50 sec (so the total 10x50=500sec) it is treated as a one group and the last action in this group should be taken - correct? please confirm or clarify further – Mikhail Berlyant Sep 18 '20 at 18:29
  • Correct, there are 4 actions, each last for about 5 secs. All 4 usually take not more than a minute. And those 4 actions are treated as one group. There can be multiple actions for the same id that are happening at the same time, but I am adding additional grouping for that purpose. Sorry if I confused you even further... – Chique_Code Sep 18 '20 at 18:36
  • to be honest - it is still not clear to me - you should present better example of input and output data with more groups and more actions in them so the answer will really address your use case rather than current very simplified example in your question. the problem for me is that for different ways of interpreting your question - i would have different answers - but I don't want to shoot in air just in sake of answering - i want first to really understand your case – Mikhail Berlyant Sep 18 '20 at 18:49

2 Answers2

1

This reads like a gaps-and-islands problem, where a gaps is greater than 1 minute, and islands represents the "chains of actions".

I would start by building groups that represent the islands: for this, you can use lag() to retrieve the previous action time, and a cumulative sum that increments for every gap of 1 minute or more between two consecutive actions:

select t.*, 
    sum(case when time > timestamp_add(lag_time, interval 1 minute) then 1 else 0 end)
        over(partition by x_id, x_url order by time) grp
from (
    select d.id, c.time, c.actionname, c.refererurl, 
        lag(time) over(partition by id, refererurl order by time) lag_time
    from `bq_query_table_c` c
    inner join `bq_table_d` d on d.id = c.creativeid
    where c.refererurl like "https://www.stackoverflow/%"
        and c.actionname in ('First', 'Midpoint', 'Third', 'Complete')
) t

grp is the island identifier.

From there on, we can use your original logic that filters the preferred action per group. We don't need to aggregate by 1 minute intervals - we can use grp instead:

select   
    array_agg(t) order by case actionname
        when 'Complete' then 1 
        when 'Third'    then 2
        when 'midpoint' then 3
        when 'first'    then 4
    end limit 1)[offset(0)]
from (
    select t.*, 
        sum(case when time > timestamp_add(lag_time, interval 1 minute) then 1 else 0 end)
            over(partition by x_id, x_url order by time) grp
    from (
        select d.id, c.time, c.actionname, c.refererurl, 
            lag(time) over(partition by id, refererurl order by time) lag_time
        from `bq_query_table_c` c
        inner join `bq_table_d` d on d.id = c.creativeid
        where c.refererurl like "https://www.stackoverflow/%"
            and c.actionname in ('First', 'Midpoint', 'Third', 'Complete')
    ) t
) t
group by id, refererurl, grp

Note that if there are, say, two "Complete" actions on a single island, it is undefined which one will be picked (you original query has pretty much the same flaw). To make the results deterministic, you want to add another sorting criteria to ARRAY_AGG(), like time for example:

    array_agg(t) order by case actionname
        when 'Complete' then 1 
        when 'Third'    then 2
        when 'midpoint' then 3
        when 'first'    then 4
    end, time limit 1)[offset(0)]
GMB
  • 216,147
  • 25
  • 84
  • 135
  • Hi, thank you for your response, I am getting an error `Unrecognized name: time` here `sum(case when time > timestamp_add` – Chique_Code Sep 22 '20 at 13:25
1

Below is for BigQuery Standard SQL

#standardSQL
WITH temp AS (
  SELECT *, PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S', time) ts
  FROM `project.dataset.bq_table`
)
SELECT * EXCEPT (ts, time_lag) FROM (
  SELECT * ,
    TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, SECOND) time_lag
  FROM (
    SELECT 
      AS VALUE ARRAY_AGG(t 
        ORDER BY STRPOS('First,Midpoint,Third,Complete',action_name) DESC 
        LIMIT 1
      )[OFFSET(0)]
    FROM temp t
    WHERE action_name IN ('First', 'Midpoint', 'Third', 'Complete')
    GROUP BY id, url, 
      TIMESTAMP_SUB(ts, INTERVAL MOD(UNIX_SECONDS(ts), 60) SECOND
      )   
  )
)
WHERE NOT IFNULL(time_lag, 777) < 60    

You can test, play with above using sample data from your question as in below example

#standardSQL
WITH `project.dataset.bq_table` AS (
  SELECT 111 id, '2020-09-01-09:19:00' time, 'First' action_name, 'www.stackoverflow/a12345' url UNION ALL
  SELECT 111, '2020-09-01-12:36:54', 'First', 'www.stackoverflow/a12345' UNION ALL
  SELECT 111, '2020-09-01-12:36:58', 'Midpoint', 'www.stackoverflow/a12345' UNION ALL
  SELECT 111, '2020-09-01-12:37:03', 'Third', 'www.stackoverflow/a12345' UNION ALL
  SELECT 111, '2020-09-01-12:37:09', 'Complete', 'www.stackoverflow/a12345' UNION ALL
  SELECT 222, '2020-09-01-15:17:44', 'First', 'www.stackoverflow/a2222' UNION ALL
  SELECT 222, '2020-09-01-15:17:48', 'Midpoint', 'www.stackoverflow/a2222' UNION ALL
  SELECT 222, '2020-09-01-15:18:05', 'Third', 'www.stackoverflow/a2222' 
), temp AS (
  SELECT *, PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S', time) ts
  FROM `project.dataset.bq_table`
)
SELECT * EXCEPT (ts, time_lag) FROM (
  SELECT * ,
    TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts), ts, SECOND) time_lag
  FROM (
    SELECT 
      AS VALUE ARRAY_AGG(t 
        ORDER BY STRPOS('First,Midpoint,Third,Complete',action_name) DESC 
        LIMIT 1
      )[OFFSET(0)]
    FROM temp t
    WHERE action_name IN ('First', 'Midpoint', 'Third', 'Complete')
    GROUP BY id, url, 
      TIMESTAMP_SUB(ts, INTERVAL MOD(UNIX_SECONDS(ts), 60) SECOND
      )   
  )
)
WHERE NOT IFNULL(time_lag, 777) < 60   

with result

Row     id      time                    action_name     url  
1       111     2020-09-01-09:19:00     First           www.stackoverflow/a12345     
2       111     2020-09-01-12:37:09     Complete        www.stackoverflow/a12345     
3       222     2020-09-01-15:18:05     Third           www.stackoverflow/a2222    

Note: I am still not 100% sure about your use case - but above is based on what discussed / commented so far

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Thank you very much, I will test it on Monday. Greatly appreciated! – Chique_Code Sep 18 '20 at 19:10
  • I have tested the code on the sample of the data you provided. The output is exactly what I need. There was some confusion in regards to requirements. You asked for more examples of input and output - I have added those as well as an additional grouping for `MetroId` filed, this should give you the full picture. This is an entire code that I have. While your code does work with the sample I can't use it, as it doesn't include `INNER JOIN `bq_table_d` d ON d.id = c.CreativeId`. I tried every which way to include this line but can't figure it out. This is way above my head :((( – Chique_Code Sep 22 '20 at 14:23
  • in your question - there is only one table presented - bq_table - and logic does not mention any use of more than this table - so that is what is used in my answer. I have no idea what are those extra tables you mentioning. In any case - I consider your original question answered. – Mikhail Berlyant Sep 22 '20 at 15:59
  • There are two tables. ```FROM `bq_query_table_c` c INNER JOIN `bq_table_d` d ON d.id = c.CreativeId``` table c and table d. It was in the original post. I consider this question answered as well, but, unfortunately, it doesn't work with the INNER JOIN that I need to perform in my query. Sadly. Thank you for your time. – Chique_Code Sep 22 '20 at 17:31
  • just to give you an idea of how your question looks from the point of view of those like me who answer your question - you presented 1) sample of table 2) logic 3) expected result - that's all i consider when answer your question - everything else in your question is a noise that I filter out as it is not consistent with the rest of items - hope this will help you to better tailor your question next time :o) – Mikhail Berlyant Sep 22 '20 at 17:49
  • 1
    I appreciate your feedback, thank you, I will try to find a way to insert INNER JOIN, and hopefully, it would work. – Chique_Code Sep 22 '20 at 17:58