Return only the newest rows from a BigQuery table with a duplicate items

Question

I have a table with many duplicate items – Many rows with the same id, perhaps with the only difference being a requested_at column.

I'd like to do a select * from the table, but only return one row with the same id – the most recently requested.

I've looked into group by id but then I need to do an aggregate for each column. This is easy with requested_at – max(requested_at) as requested_at – but the others are tough.

How do I make sure I get the value for title, etc that corresponds to that most recently updated row?

score 11 · Accepted Answer · answered Dec 08 '15 at 20:23

I suggest a similar form that avoids a sort in the window function:

SELECT *
    FROM (
      SELECT
          *,
          MAX(<timestamp_column>)
              OVER (PARTITION BY <id_column>)
              AS max_timestamp,
      FROM <table>
    )
    WHERE <timestamp_column> = max_timestamp

score 3 · Answer 2 · answered Dec 08 '15 at 20:16

Try something like this:

    SELECT *
    FROM (
      SELECT
          *,
          ROW_NUMBER()
              OVER (
                  PARTITION BY <id_column>
                  ORDER BY <timestamp column> DESC)
              row_number,
      FROM <table>
    )
    WHERE row_number = 1

Note it will add a row_number column, which you might not want. To fix this, you can select individual columns by name in the outer select statement.

In your case, it sounds like the requested_at column is the one you want to use in the ORDER BY.

And, you will also want to use allow_large_results, set a destination table, and specify no flattening of results (if you have a schema with repeated fields).

Return only the newest rows from a BigQuery table with a duplicate items

2 Answers2

Linked