2

When you need to read all the data from one or more tables in bigquery in a dataflow job there are two approaches to it I would say. The first one is to use BigQueryIO with from, which reads the table in question, and the second approach is to use fromQuery where you specify a query that reads all the data from the same table. So my question is:

  • Is it any cost or performance benefit for using one over the other?

I haven't find anything in the docs about this, but I would really like to know. I imagine that maybe read is faster since you don't need to run a query that scans the data, meaning it is more similar to the preview functionality you have in BigQuery UI. If that is true it might also be much cheaper, but it make sense if they both cost the same.

So in short, what is the difference between:

BigQueryIO.read(...).from(tableName)

And

BigQueryIO.read(...).fromQuery("SELECT * FROM " + tableName)
Tomas Jansson
  • 22,767
  • 13
  • 83
  • 137

1 Answers1

9

from is both cheaper and faster than fromQuery(SELECT * FROM ...).

  • from directly exports the table and exporting data is free for BigQuery.
  • fromQuery(SELECT * FROM ...) will first scan the entire table ($5/TB) and export the result.
Jiayuan Ma
  • 1,891
  • 2
  • 13
  • 25
  • 1
    Do you have any docs to point me to, or how do you know this? – Tomas Jansson Jan 28 '18 at 21:07
  • 1
    It's in the SDK implementation. `from` corresponds to [`BigQueryTableSource`](https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryTableSource.java) where it kicks off an export directly. `fromQuery` corresponds to [`BigQueryQuerySource`](https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryQuerySource.java) where it creates temp dataset, executes query, exports and cleans up temp resources. – Jiayuan Ma Jan 28 '18 at 22:27
  • 1
    @JiayuanMa is spot on. `read()` invokes an export of the table to GCS under-the-hood (free), whereas `readFromQuery()` issues an actual SQL query to BigQuery and you pay for it. – Graham Polley Jan 28 '18 at 23:53