I've been playing around with dataflow / bigquery for a short while, but I still can't wrap my head around some fundamental things, which is when to use a certain type of method to query a table.
The one-line query option with BigQueryIO.Read
is:
+ short and simple,
+ suitable for large results, which are returned in PCollection,
- but does not return a new table schema for the result,
making it harder to export both to (1) .csv file(s) - table header,
fields!! and (2) bigquery table - no schema!
We need to manually define the table schema or fields-csv header every time we want to save query results to bigquery table or .csv file .
---> Is there an automatic way this could be done elegantly?
The other option to query is with the use of Jobs : Query.
This has again advantages and disadvantages:
+ returns the table schema of query result
- requires authentication, in the end there's much more coding for the exact simple query
-> asynchronous and synchronous mode;
- not suitable for large results,
except in asynchronous mode with option allowLargeResults
, which results in multiple queries in the background, which need to be composed to get a full list of retrieved rows (??)
- the results can be saved into a table, but only async query can save the results in a permanent table (not just into a temporary).
The questions remain:
(1) Which method is better and when?
(2)
If there is a very large amount of data in the table we query and we get a very large result, which query method is then preferred?
(3) Which export is preferred when dealing with very large results - a table or a .csv file?