I have a use case wherein I have to read selected data from BigQuery by applying left joins on 20 different BQ tables, apply transformations on that data and then finally dump into a final BQ table.
I had two approaches in mind for achieving this which I tried on sample data (10-12 million rows in total from 20 tables) and got the results as below:
Approach 1 - Write a BQ compatible query for the entire process and fire the same through Apache Beam(using BigQueryIO.readTableRows()) and finally dump the data to the destination BigQuery table.
- Ran the job using n1-standard-4 machine type and the job finished in 6:24 mins.
Approach 2 - Write the entire process in Apache Beam and avoid using up large number of slots in BQ.
- Ran the job using n1-standard-4 machine type and the job finished in 14:50 mins.
The important thing here to consider is that the 20 source tables to join will keep on growing as time passes by and possibly we'll be having TBs of data in the near future.
Which will be the more efficient way of doing this considering the data increase that we will have in future?