I am working on to build prototype on a Big Query to performance and cost analysis,
Requirements:
Build a DW (star schema) for sales operations (incentives, leads,entitlements,forecast,marketing, leads etc) data for reporting and advanced analytics purposes. Batch Load data Daily from CRM and other upstream sales/marketing systems. Data volume is about 5 TB data per day, 90% append and 10% of data are have updates old metrics (up to previous 7 days).
Design:
Create date partitioned Big Query fact table -- Lot of queries are
trending and time-series type queries.Create aggregate tables for reporting and other repetitive queries/dashboard to limit scan volume and reduce cost.
Ingestion logic
Upstream ETL and cron jobs to drop load ready files in Google Cloud storage (there are 3rd party and other business logic which are implemented already).
Temp tables(ODS) for updates, Since partitioned table cannot up updated with DML, copy last 7 day partitions temp table (ODS) to handle updates. (copy operation no cost) -- create new temp table one per day so it is easier to copy back to fact table.
If record exists in the source files update the record in temp table -- DML operation.
new records for older days, append to corresponding day temp table.
truncate and replace (copy) last 7 day partition from temp table to DW fact table. -- copy operation no cost.
append new data (90% of total data) to DW fact table -- data load no cost
Questions
Are there better design options for the given requirements?
DML operation performance - I need update 500MB - 1 TB of data, what's the performance impact using DML operation in BigQuery and any cost to consider?
How can I estimate and project the update and copy operations performance?