Use pipeline data to query BigQuery apache_beam

Question

I want to use data from the data that runs in my pipeline to generate a query and execute it on BigQuery.

Let's say I have something like this python SQL template:

template = '''
SELECT
  email
FROM
  `project_id.dataset_id.table_id`
WHERE
  email = {runtime_email}
'''

I want to format this template in such a way that runtime_email originated from the pipeline data (element).

E.G. The pipeline reads from PubSub the variable runtime_email with the email example@test.com

And I will execute something like:

with beam.Pipeline(options=options) as p:
    bq_results = (p
        | LoadDataFromPubSub()
        | beam.io.Read(
            beam.io.BigQuerySource(
                query=template.format(element['runtime_email']),
                use_standard_sql=True
            )
        )
    )

Any ideas about how can I leverage the pipeline data to run the next step?

score 1 · Accepted Answer · answered Oct 06 '19 at 11:49

The way that you build your pipeline is incorrect. Keep in mind that Beam build a graph, and then execute it.

Here, you defined 2 sources, 1 PubSub, 1 BigQuery. The BQ source is initialized before your pipeline starts. By the way, your runtime_email will be always None.

You have 2 solutions:

Read your PubSub before starting your pipeline. You can do it in your python code or externally and provide the data in pipeline_options. Then iterate over all the Pubsub messages and build as many BQ source as you have messages.
Keep your PubSub source in the pipeline and make a standard BQ call with the python library, not with beam, for reading lines. It's the recommended way if you want to stream data.

Option one was the way to go for me. Thanks – Elon Salfati Oct 06 '19 at 15:06 — Elon Salfati, Oct 06 '19 at 15:06

Use pipeline data to query BigQuery apache_beam

1 Answers1