7

Could someone please share syntax to read/write bigquery table in a pipeline written in python for GCP Dataflow

Aditya Dixit
  • 91
  • 1
  • 3

3 Answers3

10

Run on Dataflow

First, construct a Pipeline with the following options for it to run on GCP DataFlow:

import apache_beam as beam

options = {'project': <project>,
           'runner': 'DataflowRunner',
           'region': <region>,
           'setup_file': <setup.py file>}
pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)
pipeline = beam.Pipeline(options = pipeline_options)

Read from BigQuery

Define a BigQuerySource with your query and use beam.io.Read to read data from BQ:

BQ_source = beam.io.BigQuerySource(query = <query>)
BQ_data = pipeline | beam.io.Read(BQ_source)

Write to BigQuery

There are two options to write to bigquery:

  • use a BigQuerySink and beam.io.Write:

    BQ_sink = beam.io.BigQuerySink(<table>, dataset=<dataset>, project=<project>)
    BQ_data | beam.io.Write(BQ_sink)
    
  • use beam.io.WriteToBigQuery:

    BQ_data | beam.io.WriteToBigQuery(<table>, dataset=<dataset>, project=<project>)
    
Robbe
  • 2,610
  • 1
  • 20
  • 31
  • Thanks Robbe for the response..But I need to use Google auth Json file to connect to Bigquery tables which i have created.Could you please point out the modifications that need to be made in the above code to achieve this. – Aditya Dixit Jan 24 '18 at 12:50
  • Do you mean a service account? [When running a pipeline on dataflow, it is automatically connected to the dataflow service account](https://cloud.google.com/dataflow/security-and-permissions#dataflow-service-account). You can alter the permissions of this service account in the Cloud Console. – Robbe Jan 24 '18 at 12:57
  • I just started dataflow,Could you helpout with the below : I executed Python file with below code on Cloudshell – Aditya Dixit Jan 24 '18 at 13:37
  • Getting Error: No handlers could be found for logger "oauth2client.contrib.multistore_file" Traceback (most recent call last): File "StackOverflow_B.py", line 12, in BQ_sink = beam.io.BigQuerySink(CustomerDetails_copy, dataset=StarPoc, project=star-poc) NameError: name 'CustomerDetails_copy' is not defined – Aditya Dixit Jan 24 '18 at 13:41
  • If possible could you please share a small complete working code with some sample values for the "Query" ,"table name" values filled out – Aditya Dixit Jan 24 '18 at 14:33
  • The [_Error: No handlers could be found for logger "oauth2client.contrib.multistore_file"_ is a harmles bug](https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/34). You're getting a NameError because you are supplying a variable _CustomerDetails_copy_, which is not defined. Try supplying the name of your BQ-table as a string instead. You can find documentation about the expected arguments [here](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L446) – Robbe Jan 24 '18 at 15:12
  • Providing small snippet of code i am executing: BQ_source = beam.io.BigQuerySource(query='SELECT * FROM star-poc.StarPoc.CustomerDetails') BQ_data = pipeline | beam.io.Read(BQ_source) BQ_data | beam.io.WriteToBigQuery("CustomerDetails_copy", dataset="StarPoc", project="star-poc") .But getting error – Aditya Dixit Jan 25 '18 at 08:06
  • (fe5999326acca3b7): Workflow failed. Causes: (fe5999326acca83f): S01:Read+WriteToBigQuery/WriteToBigQuery/NativeWrite failed., (da52e168e1918698): BigQuery execution failed., (da52e168e19189c9): Error: Message: Invalid table name: `star-poc:StarPoc.CustomerDetails` [Try using standard SQL – Aditya Dixit Jan 25 '18 at 08:08
  • Runner is not able to read the source table.I have provided all the permissions that might be required to access the table.Could you please suggest what might be the issue? – Aditya Dixit Jan 25 '18 at 08:09
3

Reading from Bigquery

rows = (p | 'ReadFromBQ' >> beam.io.Read(beam.io.BigQuerySource(query=QUERY, use_standard_sql=True))

writing to Bigquery

rows | 'writeToBQ' >> beam.io.Write(
beam.io.BigQuerySink('{}:{}.{}'.format(PROJECT, BQ_DATASET_ID, BQ_TEST), schema='CONVERSATION:STRING, LEAD_ID:INTEGER', create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
    write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
0
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.value_provider import StaticValueProvider

# Set the necessary pipeline options, such as project and job name
pipeline_options = PipelineOptions(
    project='your-project-id',
    job_name='dataflow-job',
    staging_location='gs://your-bucket/staging',
    temp_location='gs://your-bucket/temp',
    runner='DataflowRunner'
)

# Create a pipeline object using the options
p = beam.Pipeline(options=pipeline_options)

# Define a function to read data from BigQuery
def read_from_bigquery():
    return (p
            | 'Read from BigQuery' >> beam.io.ReadFromBigQuery(
                query='SELECT * FROM `your-project-id.your-dataset.source_table`',
                use_standard_sql=True)
            )

# Define a function to write data to BigQuery
def write_to_bigquery(data):
    return (data
            | 'Write to BigQuery' >> beam.io.WriteToBigQuery(
                table='your-project-id.your-dataset.target_table',
                schema='column_1:string,column_2:integer,column_3:boolean',
                write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)
            )

# Define your data processing logic
data = (p
        | 'Read Data' >> beam.Create(['dummy'])  # Create a dummy input element
        | 'Trigger Read' >> beam.FlatMap(lambda x: read_from_bigquery())
        | 'Process Data' >> beam.Map(lambda row: (row['column_1'], row['column_2'], row['column_3']))
        )

# Write the processed data to BigQuery
write_to_bigquery(data)

# Run the pipeline
p.run()

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 19 '23 at 20:50