Could someone please share syntax to read/write bigquery table in a pipeline written in python for GCP Dataflow
Asked
Active
Viewed 8,431 times
3 Answers
10
Run on Dataflow
First, construct a Pipeline
with the following options for it to run on GCP DataFlow:
import apache_beam as beam
options = {'project': <project>,
'runner': 'DataflowRunner',
'region': <region>,
'setup_file': <setup.py file>}
pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)
pipeline = beam.Pipeline(options = pipeline_options)
Read from BigQuery
Define a BigQuerySource
with your query and use beam.io.Read
to read data from BQ:
BQ_source = beam.io.BigQuerySource(query = <query>)
BQ_data = pipeline | beam.io.Read(BQ_source)
Write to BigQuery
There are two options to write to bigquery:
use a
BigQuerySink
andbeam.io.Write
:BQ_sink = beam.io.BigQuerySink(<table>, dataset=<dataset>, project=<project>) BQ_data | beam.io.Write(BQ_sink)
use
beam.io.WriteToBigQuery
:BQ_data | beam.io.WriteToBigQuery(<table>, dataset=<dataset>, project=<project>)

Robbe
- 2,610
- 1
- 20
- 31
-
Thanks Robbe for the response..But I need to use Google auth Json file to connect to Bigquery tables which i have created.Could you please point out the modifications that need to be made in the above code to achieve this. – Aditya Dixit Jan 24 '18 at 12:50
-
Do you mean a service account? [When running a pipeline on dataflow, it is automatically connected to the dataflow service account](https://cloud.google.com/dataflow/security-and-permissions#dataflow-service-account). You can alter the permissions of this service account in the Cloud Console. – Robbe Jan 24 '18 at 12:57
-
I just started dataflow,Could you helpout with the below : I executed Python file with below code on Cloudshell – Aditya Dixit Jan 24 '18 at 13:37
-
Getting Error: No handlers could be found for logger "oauth2client.contrib.multistore_file" Traceback (most recent call last): File "StackOverflow_B.py", line 12, in
BQ_sink = beam.io.BigQuerySink(CustomerDetails_copy, dataset=StarPoc, project=star-poc) NameError: name 'CustomerDetails_copy' is not defined – Aditya Dixit Jan 24 '18 at 13:41 -
If possible could you please share a small complete working code with some sample values for the "Query" ,"table name" values filled out – Aditya Dixit Jan 24 '18 at 14:33
-
The [_Error: No handlers could be found for logger "oauth2client.contrib.multistore_file"_ is a harmles bug](https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/34). You're getting a NameError because you are supplying a variable _CustomerDetails_copy_, which is not defined. Try supplying the name of your BQ-table as a string instead. You can find documentation about the expected arguments [here](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L446) – Robbe Jan 24 '18 at 15:12
-
Providing small snippet of code i am executing: BQ_source = beam.io.BigQuerySource(query='SELECT * FROM star-poc.StarPoc.CustomerDetails') BQ_data = pipeline | beam.io.Read(BQ_source) BQ_data | beam.io.WriteToBigQuery("CustomerDetails_copy", dataset="StarPoc", project="star-poc") .But getting error – Aditya Dixit Jan 25 '18 at 08:06
-
(fe5999326acca3b7): Workflow failed. Causes: (fe5999326acca83f): S01:Read+WriteToBigQuery/WriteToBigQuery/NativeWrite failed., (da52e168e1918698): BigQuery execution failed., (da52e168e19189c9): Error: Message: Invalid table name: `star-poc:StarPoc.CustomerDetails` [Try using standard SQL – Aditya Dixit Jan 25 '18 at 08:08
-
Runner is not able to read the source table.I have provided all the permissions that might be required to access the table.Could you please suggest what might be the issue? – Aditya Dixit Jan 25 '18 at 08:09
3
Reading from Bigquery
rows = (p | 'ReadFromBQ' >> beam.io.Read(beam.io.BigQuerySource(query=QUERY, use_standard_sql=True))
writing to Bigquery
rows | 'writeToBQ' >> beam.io.Write(
beam.io.BigQuerySink('{}:{}.{}'.format(PROJECT, BQ_DATASET_ID, BQ_TEST), schema='CONVERSATION:STRING, LEAD_ID:INTEGER', create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
0
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.value_provider import StaticValueProvider
# Set the necessary pipeline options, such as project and job name
pipeline_options = PipelineOptions(
project='your-project-id',
job_name='dataflow-job',
staging_location='gs://your-bucket/staging',
temp_location='gs://your-bucket/temp',
runner='DataflowRunner'
)
# Create a pipeline object using the options
p = beam.Pipeline(options=pipeline_options)
# Define a function to read data from BigQuery
def read_from_bigquery():
return (p
| 'Read from BigQuery' >> beam.io.ReadFromBigQuery(
query='SELECT * FROM `your-project-id.your-dataset.source_table`',
use_standard_sql=True)
)
# Define a function to write data to BigQuery
def write_to_bigquery(data):
return (data
| 'Write to BigQuery' >> beam.io.WriteToBigQuery(
table='your-project-id.your-dataset.target_table',
schema='column_1:string,column_2:integer,column_3:boolean',
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)
)
# Define your data processing logic
data = (p
| 'Read Data' >> beam.Create(['dummy']) # Create a dummy input element
| 'Trigger Read' >> beam.FlatMap(lambda x: read_from_bigquery())
| 'Process Data' >> beam.Map(lambda row: (row['column_1'], row['column_2'], row['column_3']))
)
# Write the processed data to BigQuery
write_to_bigquery(data)
# Run the pipeline
p.run()

helper.s
- 1
-
As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 19 '23 at 20:50