Reading bulk data from BigQuery using joins

Question

I have a use case wherein I have to read selected data from BigQuery by applying left joins on 20 different BQ tables, apply transformations on that data and then finally dump into a final BQ table.

I had two approaches in mind for achieving this which I tried on sample data (10-12 million rows in total from 20 tables) and got the results as below:

Approach 1 - Write a BQ compatible query for the entire process and fire the same through Apache Beam(using BigQueryIO.readTableRows()) and finally dump the data to the destination BigQuery table.

Ran the job using n1-standard-4 machine type and the job finished in 6:24 mins.

Approach 2 - Write the entire process in Apache Beam and avoid using up large number of slots in BQ.

Ran the job using n1-standard-4 machine type and the job finished in 14:50 mins.

The important thing here to consider is that the 20 source tables to join will keep on growing as time passes by and possibly we'll be having TBs of data in the near future.

Which will be the more efficient way of doing this considering the data increase that we will have in future?

score 2 · Answer 1 · answered Sep 07 '18 at 05:50

2

There are two steps. Have you tried creating a view for the first one (processing the 20 tables) and then a simple query (even from the UI) to do the second step and save it to the final table?

answered Sep 07 '18 at 05:50

FKrauss

398
2
9

I need to execute this job every 15 minutes. The solution sounds fine but I hope the performance is consistent even when the data increases. – rish0097 Sep 07 '18 at 06:39
Views are essentially queries with aliases. So if today you do that the view will technically make your queries more readable and maintainable – FKrauss Sep 07 '18 at 14:42
I understand your point but I'm asking about the performance. What about that? – rish0097 Sep 07 '18 at 15:34
Like I said, the view works with the same performance as a regular query. It's not written anywhere but I have anecdotal evidence that views actually perform better as they probably have some sort of caching. – FKrauss Sep 08 '18 at 17:11
Okay. Thanks @fkrauss – rish0097 Sep 09 '18 at 17:45

Reading bulk data from BigQuery using joins

1 Answers1