BigQueryIO Read vs fromQuery

Question

Say in Dataflow/Apache Beam program, I am trying to read table which has data that is exponentially growing. I want to improve the performance of the read.

BigQueryIO.Read.from("projectid:dataset.tablename")

or

BigQueryIO.Read.fromQuery("SELECT A, B FROM [projectid:dataset.tablename]")

Will the performance of my read improve, if i am only selecting the required columns in the table, rather than the entire table in above?

I am aware that selecting few columns results in the reduced cost. But would like to know the read performance in above.

Graham Polley · Accepted Answer · 2019-01-29T08:24:28.783

You're right that it will reduce cost instead of referencing all the columns in the SQL/query. Also, when you use from() instead of fromQuery(), you don't pay for any table scans in BigQuery. I'm not sure if you were aware of that or not.

Under the hood, whenever Dataflow reads from BigQuery, it actually calls its export API and instructs BigQuery to dump the table(s) to GCS as sharded files. Then Dataflow reads these files in parallel into your pipeline. It does not ready "directly" from BigQuery.

As such, yes, this might improve performance because the amount of data that needs to be exported to GCS under the hood, and read into your pipeline will be less i.e. less columns = less data.

However, I'd also consider using partitioned tables, and then even think about clustering them too. Also, use WHERE clauses to even further reduce the amount of data to be exported and read.

Thanks, yes I am aware of that, but didn't actually know that fromQuery() also dumps to GCS. Can you confirm if you indeed meant that fromQuery() also exports the required column data to GCS using the same export API, that is used for from() — Roshan Fernando, Jan 29 '19 at 22:37
They both dump to GCS first before reading into the pipeline. — Graham Polley, Jan 29 '19 at 22:47

BigQueryIO Read vs fromQuery

1 Answers1

Linked