How can I retrieve data from multiple BigQuery tables using Apache Beam and BigQueryIO in the best way?

Question

I understood from this thread that using ".fromQuery" is more expensive and slower than ".from", but what can I do if I need to retrieve data from multiple tables?

Currently I'm using an "INNER JOIN" query to do that, but how can I achieve the same result using ".from" (or similar)?

score 0 · Answer 1 · answered Feb 24 '20 at 10:27

0

Since you've intended to unify a data from multiple Bigquery tables leveraging BigQueryIO.Read.from() Apache Beam method, then probably you might be able to perform separate PCollection collecting data from each Bigquery table and then apply Join logic, affording some matching criteria on a table columns.

Take a look at this example, joining two Bigquery tables within particular Pcollections by previously transforming input data to tuple collections.

The above approach is a very similar to CoGroupByKey transformation method in Apache Beam SDK as a main concept for managing relational joins between Pcollections.

Read more in this thread about implementing Left Join transformation.

answered Feb 24 '20 at 10:27

Nick_Kh

5,089
2
10
16

Thanks for your answer. My problem is that the query I'm using is dynamically loaded depending on the context in which the pipeline was executed. The query can refer from one to n tables and with different criteria and "ON" statements. – Diego Arias Feb 26 '20 at 17:58
Can you elaborate more and share some example of pipeline code in order to get a point properly? – Nick_Kh Feb 27 '20 at 08:13
I can't show the code for security reasons, but imagine that I just use a query (recieved as a parameter) to get the input data for the pipeline using.fromQuery. In one case, the query is "select ... from t1 join t2 join t3" (3 tables involved), but in other case, the same pipeline executes this query "select * from t1" (just one table). And the idea of that code is to have just one pipeline to read multiple tables for multiple reasons, but the problem is that the query can change in any momment. I'm gonna try to show you some code, just let me ask for the permission. Thankyou very much. – Diego Arias Feb 27 '20 at 17:49
Here is the code. It is a single pipeline that recieves a query. "queryResultDescriptor" is the response structure: `pipeline.apply("SELECTING_STAGING_DATA", BigQueryIO.readTableRows().fromQuery(**query**).usingStandardSql()) .apply("MAPPING_STAGING_DATA", MapElements .into(TypeDescriptor.of(queryResultDescriptor)) .via((TableRow row) -> { Gson gs = new Gson(); return gs.fromJson(gs.toJson(row), queryResultDescriptor); }))` – Diego Arias Feb 27 '20 at 22:18
Thanks for the additional inputs from your side, I think that `BigQueryIO.Read.from()` method was designed to perform a bit different tasks, however can you explain what did you mean saying: _I'm using is dynamically loaded depending on the context in which the pipeline was executed_? – Nick_Kh Mar 05 '20 at 08:59

How can I retrieve data from multiple BigQuery tables using Apache Beam and BigQueryIO in the best way?

1 Answers1