0

I have a dataflow job which splits up a single file into x number of records (tables). These flow in to bigquery no problem.

What I found though was there was no way to then execute another stage in the pipeline following the results.

For example

# Collection1- filtered on first two characters = 95
collection1 = (
    rows    | 'Build pCollection1' >> beam.Filter(lambda s: data_ingestion.filterRowCollection(s, '95'))
            | 'p1 Entities to JSON' >> beam.Map(lambda s: data_ingestion.SplitRowDict(s, '95'))
            | 'Load p1 to BIGQUERY' >> beam.io.WriteToBigQuery(
                    data_ingestion.spec1,
                    schema=parse_table_schema_from_json(data_ingestion.getBqSchema('95')),
                    write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
                    create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED) # Write to Bigquery
            )

# Collection2 - filtered on first two characters = 99
collection2 = (
    rows    | 'Build pCollection2' >> beam.Filter(lambda s: data_ingestion.filterRowCollection(s, '99'))
            | 'p2 Split Entities to JSON' >> beam.Map(lambda s: data_ingestion.SplitRowDict(s, '99'))
            | 'Load p2 to BIGQUERY' >> beam.io.WriteToBigQuery(
                    data_ingestion.spec2,
                    schema=parse_table_schema_from_json(data_ingestion.getBqSchema('99')),
                    write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
                    create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED) # Write to Bigquery)

Following the above I'd like to run something like the following:

final_output = (
    collection1, collection2
       | 'Log Completion' >> beam.io.WriteToPubSub('<topic>'))

Is there anyway to run another part of the pipeline following the upsert to bigquery or is this impossible? Thanks in advance.

YetiBoy
  • 51
  • 1
  • 4

1 Answers1

0

Technically, there's no way to do exactly what you asked. beam.io.WriteToBigquery consumes the pCollection leaving nothing.

However, it's simple to duplicate the input to beam.io.WriteToBigquery in a parDo just before you call beam.io.WriteToBigquery, and to send copies of your pCollection down each path. See this answer which references this sample doFn from the docs

Steven Ensslen
  • 1,164
  • 9
  • 21
  • Ok - thanks for that. I'll have a look at splitting the outputs so to retain a pcollection after the upsert. – YetiBoy Nov 16 '20 at 09:19