3

Given some rows with duplicate names and different timestamps, I would like to select the row with the newest timestamp, if the duplicate name occurs within say, 45 minutes, of the first timestamp.

Here's what worked in PostgreSQL:

SELECT i.ts AS base_timestamp, j.ts AS newer_timestamp, i.name
FROM tbl i
LEFT JOIN LATERAL
(SELECT j.ts
 FROM tbl j
 WHERE i.name = j.name
 AND j.ts > i.ts
 AND j.ts < (i.ts + INTERVAL '45 minutes')
) j ON TRUE
WHERE j.ts is NULL

Great explanation of LATERAL here: https://heap.io/blog/engineering/postgresqls-powerful-new-join-type-lateral

LATERAL join is like a SQL foreach loop, in which PostgreSQL will iterate over each row in a result set and evaluate a subquery using that row as a parameter.

So it's like a correlated subquery, but in the join.

Then I simply take only the rows where there is no newer timestamp (WHERE j.ts is NULL).

How can I do this in BigQuery?

EDIT: I've created an example of the PostgreSQL grouping on SQLFiddle as requested in the comments.

Input:

  ('Duplication Example','2019-06-22 19:10:25'),
  ('Duplication Example','2019-06-22 23:58:31'), 
  ('Duplication Example','2019-06-23 00:08:00')

Output (middle row having timestamp 23:58:31 removed):

base_timestamp          newer_timestamp name
2019-06-22T19:10:25Z    (null)          Duplication Example
2019-06-23T00:08:00Z    (null)          Duplication Example
Martin Burch
  • 2,726
  • 4
  • 31
  • 59
  • There's something clever here about using seconds as a numeric `RANGE` clause https://medium.com/@ubethke/window-function-rows-and-range-on-redshift-and-bigquery-132cbf80c656 – Martin Burch Dec 05 '19 at 10:56
  • please clarify - are you more interested in resolving/addressing of what is in the title of your post (which is quite generic and conceptual one) or what is in the first sentence in the question body (more practical and most likely your real use case to be addressed)? those are two very different things - so please clarify! – Mikhail Berlyant Dec 05 '19 at 19:24
  • Hi @MikhailBerlyant and thanks for taking a look at my question. I'm more interested in solving my specific problem as stated in the first sentence of the question body. If the answer begins with "No, BigQuery doesn't have LATERAL, you'll need to solve your problem using ..." that would be most useful. – Martin Burch Dec 05 '19 at 22:13
  • that is what I expected - will get back to you with answer as soon as have space time :o) – Mikhail Berlyant Dec 05 '19 at 22:44
  • 2
    please add some simplified example of input data and expected output. I tried to read / analyse your case and see that it is still not clear what exactly logic you want to apply for producing result. good example of input / output will definitely help. just one name will be enough but some representative number of ts(timestamps) please – Mikhail Berlyant Dec 05 '19 at 23:49
  • Thanks for your guidance, @MikhailBerlyant ... I've now updated the question with a small example. What do you think, need more rows? – Martin Burch Dec 08 '19 at 23:15
  • in my mind - you **either** explain desired logic in all details **or** present really good example with much more than just two-three rows so we can gather the logic. example you currently have is not helping at all :o( – Mikhail Berlyant Dec 12 '19 at 23:29

2 Answers2

1

Your case looks kind of like a task for window functions. But since you seem to be interested in lateral joins more than in solving the problem you presented: In BigQuery there is afaik only an implicit version of lateral joins: when joining with unnested arrays.

This showcases the idea:

WITH t AS (
  SELECT 'a' as id, [2,3] as arr 
  UNION ALL SELECT 'b', [56, 7]
)

SELECT * EXCEPT(arr) 
FROM t LEFT JOIN UNNEST(arr)
Martin Weitzmann
  • 4,430
  • 10
  • 19
  • 1
    Hi fellow Martin! I agree, this does look like a task which could be accomplished with a window function. About the same time you submitted your answer, I added a comment linking to this article: https://medium.com/@ubethke/window-function-rows-and-range-on-redshift-and-bigquery-132cbf80c656 ... but I'm still not clear on how exactly to do it. Your answer seems to be more of a CROSS JOIN replacement: `WITH t AS ( SELECT 'a' as id, [2,3] AS arr UNION ALL SELECT 'b', [56, 7] AS arr ) SELECT id, arr FROM t CROSS JOIN t.arr` – Martin Burch Dec 05 '19 at 11:16
  • 1
    Not sure what you mean, but a Cross Join on an unnested array is also an implicit lateral join. – Martin Weitzmann Dec 05 '19 at 15:57
  • 1
    To be more clear: if you think this requires a window function in BigQuery, please show a query with a window function. I am interested in solving the problem as stated in the question, and I don't see how to do it with the query example you've provided. – Martin Burch Dec 05 '19 at 16:14
  • 1
    As mentioned in the post, I didn't solve your problem but rather explained lateral joins in BigQuery. All good. – Martin Weitzmann Dec 05 '19 at 16:21
1

This can be archived with a WINDOW function.

SELECT
  name,
  MAX(timestamp) AS timestamp_new
FROM
(
  SELECT 
    i.name,
    COUNT(*) OVER (PARTITION BY i.name ORDER BY i.ts RANGE BETWEEN 45 * 60 * 1000 PRECEDING AND CURRENT ROW) as 45_window_count,
    i.ts AS timestamp
  FROM 
    tbl i
)
WHERE 45_window_count > 1
GROUP BY user
DinushaNT
  • 1,137
  • 6
  • 17
  • This seems like it will need a correction to the numeric range, as 45,000,000 milliseconds is more than 12 hours. The correct calculation is more like 45 * 60 * 1000, no? 45 minutes * 60 seconds in a minute * 1000 milliseconds in a second. – Martin Burch Dec 20 '19 at 11:51
  • 1
    Correct. Updated the answer. – DinushaNT Dec 20 '19 at 11:53