1

My scenario requires me to look at sessions that are less than 60 seconds apart as the same session.

data is like below.

Min_Timestamp                Max_Timestamp                Device_ID  Session_ID  Prev_Max_Timestamp           Diff_Sec
2019-12-03 23:05:30.416 UTC  2019-12-03 23:09:13.502 UTC  AAAAA      I90HYTRFJI  null                         null
2019-12-03 23:09:21.517 UTC  2019-12-03 23:09:53.353 UTC  AAAAA      98UHIGSNJR  2019-12-03 23:09:13.502 UTC  8
2019-12-03 00:00:28.933 UTC  2019-12-03 00:09:03.473 UTC  BBBBB      32QE8Y76TG  null                         null
2019-12-03 00:09:19.106 UTC  2019-12-03 00:23:26.554 UTC  BBBBB      R4GUY432AD  2019-12-03 00:09:03.473 UTC  16
2019-12-03 00:23:26.818 UTC  2019-12-03 00:23:26.837 UTC  BBBBB      E32GUYE328  2019-12-03 00:23:26.554 UTC  0
2019-12-03 17:00:32.160 UTC  2019-12-03 17:03:48.758 UTC  BBBBB      GY1EW32876  2019-12-03 00:23:26.837 UTC  59825
2019-12-03 17:03:58.069 UTC  2019-12-03 17:17:12.408 UTC  BBBBB      2876T128Y7  2019-12-03 17:03:48.758 UTC  9
2019-12-03 17:18:24.528 UTC  2019-12-03 17:18:27.516 UTC  BBBBB      098U6598U5  2019-12-03 17:17:12.408 UTC  73
2019-12-03 16:30:29.970 UTC  2019-12-03 18:44:18.972 UTC  CCCCC      UWI4UII2J4  null                         null
2019-12-04 17:32:19.285 UTC  2019-12-04 17:32:24.668 UTC  CCCCC      G3247ROIUH  2019-12-03 18:44:18.972 UTC  82080

Group the sessions together that are less than 60 seconds apart while still separate by device. It will look like this.

Min_Timestamp                Max_Timestamp                Device_ID  Session_ID  Prev_Max_Timestamp           Diff_Sec
2019-12-03 23:05:30.416 UTC  2019-12-03 23:09:13.502 UTC  AAAAA      I90HYTRFJI  null                         null
2019-12-03 23:09:21.517 UTC  2019-12-03 23:09:53.353 UTC  AAAAA      98UHIGSNJR  2019-12-03 23:09:13.502 UTC  8

2019-12-03 00:00:28.933 UTC  2019-12-03 00:09:03.473 UTC  BBBBB      32QE8Y76TG  null                         null
2019-12-03 00:09:19.106 UTC  2019-12-03 00:23:26.554 UTC  BBBBB      R4GUY432AD  2019-12-03 00:09:03.473 UTC  16
2019-12-03 00:23:26.818 UTC  2019-12-03 00:23:26.837 UTC  BBBBB      E32GUYE328  2019-12-03 00:23:26.554 UTC  0

2019-12-03 17:00:32.160 UTC  2019-12-03 17:03:48.758 UTC  BBBBB      GY1EW32876  2019-12-03 00:23:26.837 UTC  59825
2019-12-03 17:03:58.069 UTC  2019-12-03 17:17:12.408 UTC  BBBBB      2876T128Y7  2019-12-03 17:03:48.758 UTC  9
2019-12-03 17:18:24.528 UTC  2019-12-03 17:18:27.516 UTC  BBBBB      098U6598U5  2019-12-03 17:17:12.408 UTC  73

2019-12-03 16:30:29.970 UTC  2019-12-03 18:44:18.972 UTC  CCCCC      UWI4UII2J4  null                         null

2019-12-04 17:32:19.285 UTC  2019-12-04 17:32:24.668 UTC  CCCCC      G3247ROIUH  2019-12-03 18:44:18.972 UTC  82080

I want to be able to get something that look like this. Session_ID does not need to be like A1, B1, C1, etc. It can simply be the first value of the session. Notice the Max_Timestamp of the latest one is now the new Max_Timestamp.

Min_Timestamp                Max_Timestamp                Device_ID  Session_ID
2019-12-03 23:05:30.416 UTC  2019-12-03 23:09:53.353 UTC  AAAAA      A1          
2019-12-03 00:00:28.933 UTC  2019-12-03 00:23:26.837 UTC  BBBBB      B1
2019-12-03 17:00:32.160 UTC  2019-12-03 17:18:27.516 UTC  BBBBB      B2
2019-12-03 16:30:29.970 UTC  2019-12-03 18:44:18.972 UTC  CCCCC      C1
2019-12-04 17:32:19.285 UTC  2019-12-04 17:32:24.668 UTC  CCCCC      C2

My idea was to make all Session_ID that belong to the same group the same. Then group by Device_ID and Session_ID to get min(Min_Timestamp) and max(Max_Timestamp). I tried to fiddle with first_value() on Session_ID, but I can't figure out how to partition it correctly.

Best to achieve this in legacy. if not, standard will work,too.

Colton Pan
  • 25
  • 1
  • 6
  • have you tried anything by yourself? can you show some efforts? it is relatively simple and in addition you should be able to find similar questions here on SO! have you tried to do a search? – Mikhail Berlyant Dec 05 '19 at 01:22
  • I have. I mentioned in the question that I tried to use first_value() on Session_ID and partition it, but I can't figure out how to partition it correctly. I have already searched on here. Could be my keywords for the searches were not right. Mind sharing a post that contains an answer to this? – Colton Pan Dec 05 '19 at 01:30
  • to me personally - it is easier to just answer (*which most likely I will do when have time unless it will be already answered by that time*) your question than searching - but still SO etiquette expects you to apply some efforts to search similar questions (and I know for a fact there are many here) and/or to present your efforts via example of query you tried and how it is not working the way you want, etc. Otherwise such posts look like attempt of outsourcing your homework which is not welcome on SO. – Mikhail Berlyant Dec 05 '19 at 01:37
  • I very much understand that I need to make my own effort. I fully understand SO isn't a place where I copy and paste my questions and expect someone to do the job for me. My efforts were all over the place. I had no lead. I don't even know if using first_value() was the right way and I'm pretty sure it isn't. My codes that aren't working which I already know isn't the right way and not even on the way to the solution. I don't even know how to search for an answer to this question. This is the type of question where I don't really know where to start on. I can't even get close to the solution. – Colton Pan Dec 05 '19 at 01:41
  • understood! btw, why legacy is preferred for you? – Mikhail Berlyant Dec 05 '19 at 01:44
  • My workplace uses a dashboard service that is currently switched to use legacy from bigquery. It would be easier to just do the grouping/aggregation thingy within the dashboard. If this is only achievable through standard, I'll just have to get a workflow going in the back-end to populate a table with the result then query the table from the dashboard service. – Colton Pan Dec 05 '19 at 01:48

1 Answers1

1

Below is for BigQuery Standard SQL (if you want - just "translate" it to Legacy - but the advice is to migrate to Standard anyway!!! so do it now and use below)

#standardSQL
SELECT MIN(Min_Timestamp) AS Min_Timestamp, MAX(Max_Timestamp) AS Max_Timestamp, Device_ID, Session_ID
FROM (
  SELECT * EXCEPT(flag, Session_ID), 
    CONCAT(Device_ID, CAST(COUNTIF(flag) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp) AS STRING)) AS Session_ID
  FROM (
    SELECT *, 
      IFNULL(TIMESTAMP_DIFF(Min_Timestamp, LAG(Max_Timestamp) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp), SECOND), 999) > 60 flag
    FROM `project.dataset.table`
  )
)
GROUP BY Device_ID, Session_ID

You can test, play with above using sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT TIMESTAMP '2019-12-03 23:05:30.416 UTC' Min_Timestamp, TIMESTAMP '2019-12-03 23:09:13.502 UTC' Max_Timestamp, 'AAAAA' Device_ID, 'I90HYTRFJI' Session_ID UNION ALL
  SELECT '2019-12-03 23:09:21.517 UTC', '2019-12-03 23:09:53.353 UTC', 'AAAAA', '98UHIGSNJR' UNION ALL
  SELECT '2019-12-03 00:00:28.933 UTC', '2019-12-03 00:09:03.473 UTC', 'BBBBB', '32QE8Y76TG' UNION ALL
  SELECT '2019-12-03 00:09:19.106 UTC', '2019-12-03 00:23:26.554 UTC', 'BBBBB', 'R4GUY432AD' UNION ALL
  SELECT '2019-12-03 00:23:26.818 UTC', '2019-12-03 00:23:26.837 UTC', 'BBBBB', 'E32GUYE328' UNION ALL
  SELECT '2019-12-03 17:00:32.160 UTC', '2019-12-03 17:03:48.758 UTC', 'BBBBB', 'GY1EW32876' UNION ALL
  SELECT '2019-12-03 17:03:58.069 UTC', '2019-12-03 17:17:12.408 UTC', 'BBBBB', '2876T128Y7' UNION ALL
  SELECT '2019-12-03 17:18:24.528 UTC', '2019-12-03 17:18:27.516 UTC', 'BBBBB', '098U6598U5' UNION ALL
  SELECT '2019-12-03 16:30:29.970 UTC', '2019-12-03 18:44:18.972 UTC', 'CCCCC', 'UWI4UII2J4' UNION ALL
  SELECT '2019-12-04 17:32:19.285 UTC', '2019-12-04 17:32:24.668 UTC', 'CCCCC', 'G3247ROIUH' 
)
SELECT MIN(Min_Timestamp) AS Min_Timestamp, MAX(Max_Timestamp) AS Max_Timestamp, Device_ID, Session_ID
FROM (
  SELECT * EXCEPT(flag, Session_ID), 
    CONCAT(Device_ID, CAST(COUNTIF(flag) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp) AS STRING)) AS Session_ID
  FROM (
    SELECT *, 
      IFNULL(TIMESTAMP_DIFF(Min_Timestamp, LAG(Max_Timestamp) OVER(PARTITION BY Device_ID ORDER BY Max_Timestamp), SECOND), 999) > 60 flag
    FROM `project.dataset.table`
  )
)
GROUP BY Device_ID, Session_ID
-- ORDER BY Device_ID, Session_ID  

with output

Row Min_Timestamp               Max_Timestamp               Device_ID   Session_ID   
1   2019-12-03 23:05:30.416 UTC 2019-12-03 23:09:53.353 UTC AAAAA       AAAAA1   
2   2019-12-03 00:00:28.933 UTC 2019-12-03 00:23:26.837 UTC BBBBB       BBBBB1   
3   2019-12-03 17:00:32.160 UTC 2019-12-03 17:17:12.408 UTC BBBBB       BBBBB2   
4   2019-12-03 17:18:24.528 UTC 2019-12-03 17:18:27.516 UTC BBBBB       BBBBB3   
5   2019-12-03 16:30:29.970 UTC 2019-12-03 18:44:18.972 UTC CCCCC       CCCCC1   
6   2019-12-04 17:32:19.285 UTC 2019-12-04 17:32:24.668 UTC CCCCC       CCCCC2     
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230