0

I want to use data from the data that runs in my pipeline to generate a query and execute it on BigQuery.

Let's say I have something like this python SQL template:

template = '''
SELECT
  email
FROM
  `project_id.dataset_id.table_id`
WHERE
  email = {runtime_email}
'''

I want to format this template in such a way that runtime_email originated from the pipeline data (element).

E.G. The pipeline reads from PubSub the variable runtime_email with the email example@test.com

And I will execute something like:

with beam.Pipeline(options=options) as p:
    bq_results = (p
        | LoadDataFromPubSub()
        | beam.io.Read(
            beam.io.BigQuerySource(
                query=template.format(element['runtime_email']),
                use_standard_sql=True
            )
        )
    )

Any ideas about how can I leverage the pipeline data to run the next step?

Elon Salfati
  • 1,537
  • 6
  • 23
  • 46

1 Answers1

1

The way that you build your pipeline is incorrect. Keep in mind that Beam build a graph, and then execute it.

Here, you defined 2 sources, 1 PubSub, 1 BigQuery. The BQ source is initialized before your pipeline starts. By the way, your runtime_email will be always None.

You have 2 solutions:

  • Read your PubSub before starting your pipeline. You can do it in your python code or externally and provide the data in pipeline_options. Then iterate over all the Pubsub messages and build as many BQ source as you have messages.
  • Keep your PubSub source in the pipeline and make a standard BQ call with the python library, not with beam, for reading lines. It's the recommended way if you want to stream data.
guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76