Apache Beam - Bigquery Upsert

Question

I have a dataflow job which splits up a single file into x number of records (tables). These flow in to bigquery no problem.

What I found though was there was no way to then execute another stage in the pipeline following the results.

For example

# Collection1- filtered on first two characters = 95
collection1 = (
    rows    | 'Build pCollection1' >> beam.Filter(lambda s: data_ingestion.filterRowCollection(s, '95'))
            | 'p1 Entities to JSON' >> beam.Map(lambda s: data_ingestion.SplitRowDict(s, '95'))
            | 'Load p1 to BIGQUERY' >> beam.io.WriteToBigQuery(
                    data_ingestion.spec1,
                    schema=parse_table_schema_from_json(data_ingestion.getBqSchema('95')),
                    write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
                    create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED) # Write to Bigquery
            )

# Collection2 - filtered on first two characters = 99
collection2 = (
    rows    | 'Build pCollection2' >> beam.Filter(lambda s: data_ingestion.filterRowCollection(s, '99'))
            | 'p2 Split Entities to JSON' >> beam.Map(lambda s: data_ingestion.SplitRowDict(s, '99'))
            | 'Load p2 to BIGQUERY' >> beam.io.WriteToBigQuery(
                    data_ingestion.spec2,
                    schema=parse_table_schema_from_json(data_ingestion.getBqSchema('99')),
                    write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
                    create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED) # Write to Bigquery)

Following the above I'd like to run something like the following:

final_output = (
    collection1, collection2
       | 'Log Completion' >> beam.io.WriteToPubSub('<topic>'))

Is there anyway to run another part of the pipeline following the upsert to bigquery or is this impossible? Thanks in advance.

score 0 · Answer 1 · answered Nov 16 '20 at 02:36

0

Technically, there's no way to do exactly what you asked. beam.io.WriteToBigquery consumes the pCollection leaving nothing.

However, it's simple to duplicate the input to beam.io.WriteToBigquery in a parDo just before you call beam.io.WriteToBigquery, and to send copies of your pCollection down each path. See this answer which references this sample doFn from the docs

answered Nov 16 '20 at 02:36

Steven Ensslen

1,164
9
21

Ok - thanks for that. I'll have a look at splitting the outputs so to retain a pcollection after the upsert. – YetiBoy Nov 16 '20 at 09:19

Apache Beam - Bigquery Upsert

1 Answers1