4

I am working on to build prototype on a Big Query to performance and cost analysis,

Requirements:

Build a DW (star schema) for sales operations (incentives, leads,entitlements,forecast,marketing, leads etc) data for reporting and advanced analytics purposes. Batch Load data Daily from CRM and other upstream sales/marketing systems. Data volume is about 5 TB data per day, 90% append and 10% of data are have updates old metrics (up to previous 7 days).

Design:

  • Create date partitioned Big Query fact table -- Lot of queries are
    trending and time-series type queries.

  • Create aggregate tables for reporting and other repetitive queries/dashboard to limit scan volume and reduce cost.

Ingestion logic

  1. Upstream ETL and cron jobs to drop load ready files in Google Cloud storage (there are 3rd party and other business logic which are implemented already).

  2. Temp tables(ODS) for updates, Since partitioned table cannot up updated with DML, copy last 7 day partitions temp table (ODS) to handle updates. (copy operation no cost) -- create new temp table one per day so it is easier to copy back to fact table.

  3. If record exists in the source files update the record in temp table -- DML operation.

  4. new records for older days, append to corresponding day temp table.

  5. truncate and replace (copy) last 7 day partition from temp table to DW fact table. -- copy operation no cost.

  6. append new data (90% of total data) to DW fact table -- data load no cost

Questions

Are there better design options for the given requirements?

DML operation performance - I need update 500MB - 1 TB of data, what's the performance impact using DML operation in BigQuery and any cost to consider?

How can I estimate and project the update and copy operations performance?

Tim S
  • 185
  • 1
  • 13
  • 1
    this is a great question, would love to see an answer. I think the problem is that to answer it would require doing everything in the question. Try doing it yourself and reporting back with the results. – Harry Moreno Jun 28 '18 at 17:42

2 Answers2

1

Regarding the Star Schema, the prefered method of normalizing data in BigQuery is to use the nested and repeated structures. IMO you can see BigQuery as a very powerful database which can be used as a DW.

For your time series needs you can use ingestion-time partitioned tables

Instead of cron jobs to query you can use schedule queries or GCS triggers with cloud functions. They can be trigger when a new file is added in GCS.

Regarding you Ingestion logic, I would recommend to use Cloud Composer which is the integration of Airflow in the GCP.

You will be able to do all you said using the BigQuery operators. With this operator you won't need to use DML to create the tables as you can use the BigQueryCreateEmptyTableOperator. Using Airflow you won't need cron jobs.

Regarding the cost you can use the pricing calculator

I hope this helps!

Nathan Nasser
  • 1,008
  • 7
  • 18
0

This workflow is nearly identical to what I set up on GCP for Sales/Marketing/Product data.

My workflow consists of a combination of Python, BigQuery Standard SQL, Google Cloud Composer (GCP Airflow instance), and Google Container Registry.

Our workflows are developed locally in a python class that processes all of the steps needed for ETL replication. Most of the actual steps for replication are Just DML statements in bigquery, but we rely on the python class for standardization, code reuse and execution order/logging.

We package that up into a docker container and put on Google cloud registry.

We use the Google Cloud Composer (Airflow) Kubernetes Pod Operator to run the python files in the container. This helps so that we don't have to worry about any of the dev ops problems with maintaining virtual machines.

Airflow is great for scheduling and dependency graphing. There isn't a way in Bigquery to easily schedule queries and maintain execution order for your aggregate tables. With Airflow you can easily kick off those aggregate queries once your ETL pipeline completes.

I am not 100% sure what costs you could expect, but storage and processing is pretty cheap in BigQuery. Composer and Kubernetes is where the bulk of the spend goes for us.

While this workflow takes a couple days to get set up, it gets a lot more powerful when you need to get into predictive/prescriptive analytics with your sales data. With python/Airflow, you can do all of your etl, run your aggregation queries, do any Machine learning analysis from R or scikit learn, and then deliver the results back to BigQuery and your CRM all in the correct order on whatever schedule you want.

If you want some code samples or help with the quickstarts on the GCP website, you can DM me.

chadf
  • 188
  • 6
  • Do you have different layers inside BQ (Loaded,Cleansed, Accumulated,Processed..etc) ? If so do you perform the ETL processes between these layers inside BQ it self ? – Ashika Umanga Umagiliya Oct 02 '20 at 02:27
  • Yes. we have datasets that iteratively build from raw data to processed. We do 90% of the transformation between these layers right in bigquery using a tool called DBT. It helps with lineage graphs, timing, testing, documentation etc. You can use the paid version or their open source CLI – chadf Jun 07 '21 at 17:56