Given some rows with duplicate name
s and different timestamp
s, I would like to select the row with the newest timestamp
, if the duplicate name
occurs within say, 45 minutes, of the first timestamp
.
Here's what worked in PostgreSQL:
SELECT i.ts AS base_timestamp, j.ts AS newer_timestamp, i.name
FROM tbl i
LEFT JOIN LATERAL
(SELECT j.ts
FROM tbl j
WHERE i.name = j.name
AND j.ts > i.ts
AND j.ts < (i.ts + INTERVAL '45 minutes')
) j ON TRUE
WHERE j.ts is NULL
Great explanation of LATERAL
here:
https://heap.io/blog/engineering/postgresqls-powerful-new-join-type-lateral
LATERAL join is like a SQL foreach loop, in which PostgreSQL will iterate over each row in a result set and evaluate a subquery using that row as a parameter.
So it's like a correlated subquery, but in the join.
Then I simply take only the rows where there is no newer timestamp (WHERE j.ts is NULL
).
How can I do this in BigQuery?
EDIT: I've created an example of the PostgreSQL grouping on SQLFiddle as requested in the comments.
Input:
('Duplication Example','2019-06-22 19:10:25'),
('Duplication Example','2019-06-22 23:58:31'),
('Duplication Example','2019-06-23 00:08:00')
Output (middle row having timestamp 23:58:31 removed):
base_timestamp newer_timestamp name
2019-06-22T19:10:25Z (null) Duplication Example
2019-06-23T00:08:00Z (null) Duplication Example