2

We currently have a proposed architecture for dealing with the append only nature of BigQuery, to support a dashboard and reporting function in our system. We have a number of customer tables (tenants) - let's say one table per customer. Each table contains a BigQuery 'mirror' of a primary NoSQL data store - a list of records. We want to use BigQuery as the tool that drives the reporting and dashboard of these events, for customers.

Customers may make updates to these records - these updates would need to be reflected in BigQuery. We are looking at maintaining a secondary table, for each primary table, called "Reconcile" that stores the updates (or deletions) in the last, say, 10 minutes. Then, every 10 minutes, we would select all the data from the "Reconcile" table and all the data from the primary table, minus the rows from the Reconcile table, to create a new up to date view.

(Old Table - 'Reconcile' Rows) + Reconcile => New Table
Drop Old Table, Truncate Reconcile Table

Our challenge here is that some customer tables can be 10,20,30GB large. The rematerialization process, run every 10 minutes could quickly become expensive.

Is there a more (cost) or technically efficient way to copy tables in BigQuery without incurring this large operation (and its associated cost)? We don't really want to use the DML and don't want to have to scope every query to search for the 'latest' row for each record (although this answer Does Google BigQuery/ Amazon Redshift use column-based relational database or NoSQL database? does a good job of explaining it).

Community
  • 1
  • 1
user3707
  • 1,470
  • 2
  • 15
  • 21
  • We simply setup a `view` for every table that is "scoping" it to latest row, technically it's quite easy. `table_latest` is always available and we just change the table name in the query. Running every 10 minutes is quite aggressive, I would rather run weekly - also depends on the usage too. – Pentium10 Oct 20 '16 at 07:33
  • Do you not do rematerialisation then? We suggested 10 mins as people are using BQ for reports and that's a fairly acceptable delay for reporting. With your scoping approach presumably you just keep the old data around? – user3707 Oct 20 '16 at 09:54
  • We don't do rematerialization and running for 2 years and we keep old records, we even replayed some old records with extended data. Costs are still low and doesn't worth the effort from our end. Storage prices are quite low. – Pentium10 Oct 20 '16 at 10:00
  • We have also a report to improve UX/UI in order for our users to do something in less steps, based on metrics we find they do things multiple times. We deduct this from the count of updates on the same event. – Pentium10 Oct 20 '16 at 10:03
  • Are you able to share how you constructed your view to only show the latest 'versions' of your records? Also, out of interest, how big is your 2 year old dataset? – user3707 Oct 20 '16 at 15:04
  • The view is exactly the query that is linked on the other answer. The largest table is more than 380GB, 1.6bn rows. – Pentium10 Oct 20 '16 at 16:24
  • We've tried to do the same using this query: SELECT * FROM (SELECT *, row_number() over (partition BY id ORDER BY actual_scheduled_start DESC) seqnum FROM [PROJ:test2.jobs_final]) WHERE seqnum=1. We've saved this as a view and are running a query on the view but receive "Query exceeded resource limits for tier 1. Tier 32 or higher required.". Have read the other answer and the query looks the same , have you any ideas? – user3707 Oct 20 '16 at 19:45
  • Well you could paste that job as new question with relevant job ID and maybe someone can look into from the engineers. – Pentium10 Oct 20 '16 at 19:51

0 Answers0