0

I've been playing around with dataflow / bigquery for a short while, but I still can't wrap my head around some fundamental things, which is when to use a certain type of method to query a table.

The one-line query option with BigQueryIO.Read is:
+ short and simple,
+ suitable for large results, which are returned in PCollection,
- but does not return a new table schema for the result,
making it harder to export both to (1) .csv file(s) - table header, fields!! and (2) bigquery table - no schema! We need to manually define the table schema or fields-csv header every time we want to save query results to bigquery table or .csv file .
---> Is there an automatic way this could be done elegantly?

The other option to query is with the use of Jobs : Query.
This has again advantages and disadvantages:
+ returns the table schema of query result
- requires authentication, in the end there's much more coding for the exact simple query
-> asynchronous and synchronous mode;
- not suitable for large results,
except in asynchronous mode with option allowLargeResults, which results in multiple queries in the background, which need to be composed to get a full list of retrieved rows (??)
- the results can be saved into a table, but only async query can save the results in a permanent table (not just into a temporary).

The questions remain:

(1) Which method is better and when?

(2) If there is a very large amount of data in the table we query and we get a very large result, which query method is then preferred?

(3) Which export is preferred when dealing with very large results - a table or a .csv file?

user2107356
  • 115
  • 9
  • inspired by my previous question http://stackoverflow.com/questions/34518795/get-tableschema-from-bigquery-result-pcollectiontablerow – user2107356 Jan 03 '16 at 02:38
  • I'm trying to answer this question, but it's unclear what you actually want. You're asking about some very low-level issues -- can you please describe the actual end-to-end data processing that you're trying to do? – Dan Halperin Jan 04 '16 at 04:02
  • Note that if all you want to do is save the result of a query, you should probably just use BigQuery's asynchronous API to run the query and then export CSV to GCS. – Dan Halperin Jan 04 '16 at 04:03
  • Regarding "Asynchronous vs Synchronous" query mode: there is no truly synchronous way to run a query in BigQuery. If you read carefully, jobs.query requires polling to wait for job completion prior to downloading data, just like job.insert does. For the purpose of your question, I'd stick to jobs.insert. – Michael Sheldon Jan 04 '16 at 18:36

1 Answers1

1

Note that BigQueryIO.Read reads a table, and does not execute a query. The helper .fromQuery() lets the source execute a query and then read the destination table that the query results were written to. It may be useful to separate query execution from table reading.

We can think of BigQueryIO.Read as running a jobs.insert() for an export operation to Google Cloud Storage, then processes the data from GCS.

Look at "Multiple Wildcard Uris" at https://cloud.google.com/bigquery/exporting-data-from-bigquery#exportingmultiple for efficient parallel processing of data exported to GCS: you can consume (small) shards of exported data as it is being exported, so you don't have to block processing of the parallel read operations for the entire BigQuery export job to complete. The multiple wildcards lets you determine the parallelism of workers up front, and BigQuery will write the data in order so each worker can process their shards independently and identify when their subset of data has been completely processed.

Cloud Dataflow supports custom sources (https://cloud.google.com/dataflow/model/custom-io#creating-sources), so you can replicate BigQueryIO.Read with a version that has a side channel for the schema of the table you are operating on(*).

(*) I'm not a Cloud Dataflow expert, so I'm not sure how to build this side channel. Perhaps the schema exported from this custom source is the table's schema and a blob, and you could wrap the source a ParDo that takes the schema and converts the blob to that schema? At which point your code would see the schema and the data for each row. This doesn't seem like the most efficient way to solve the problem, but it might work for you.

Michael Sheldon
  • 2,027
  • 11
  • 7