3

I have a BigQuery table with ~5k unique IDs. Every day new rows are inserted for IDs that may or may not already exist.

We use this query to find the most recent rows:

SELECT t.*
EXCEPT (seqnum),
FROM (SELECT t.*,
             ROW_NUMBER() OVER (PARTITION BY id
                                ORDER BY date_of_data DESC
                               ) as seqnum
      FROM `[project]`.[dataset].[table] t
     ) t
WHERE seqnum = 1

Although we only want the most recent row for each ID, this query must scan the entire table. This query is slower and more expensive every day as the table size grows. Right now, for an 8GB table, the query above creates a 22MB table. We would much rather query the 22MB table if it could stay up-to-date.

Is it possible to create a materialized view that gets the latest rows for each ID?

Is there a better solution than growing tables to infinity?

Other requirements:

ProGirlXOXO
  • 2,170
  • 6
  • 25
  • 47
  • Use `_PARTITIONTIME` – drum Apr 24 '21 at 00:46
  • @drum Can you please elaborate? – ProGirlXOXO Apr 24 '21 at 01:20
  • Partition your tables by ingestion time and _PARTITIONTIME column will become available to you. It's literally ingestion time so you can tell more or less which ones are the new records. – drum Apr 24 '21 at 02:06
  • Thanks. The table is already partitioned by `date_of_data`. I'm looking for the newest rows *per id*. I'm not just looking for the newest records, I'm looking for the newest records per id. So one could be in the newest records but if another id has failed to update for years then it would be from 2018, etc. – ProGirlXOXO Apr 26 '21 at 19:12

1 Answers1

1

One of the solutions would be to partition your main table (with all rows) by column date_of_data with a daily granularity.

Create a separate table which will keep only the most recent row for each ID. Populate it once with a single scan of entire main table and then update it every day by querying only the last day of the main table. Thanks to the partitioning querying the last day of the main table will scan only the last day of the main table.

Sergey Geron
  • 9,098
  • 2
  • 22
  • 29