0

I'm running an hourly dataflow job that reads records from a source table, processes and writes them to a target table. Since some of the records may repeat in the source table, we've created a hash value based on the record fields of interest, append it to the read source table records(in memory), and filter out the existing hashes already stored on the target table(the hash value will be stored in the target table). This way we aim to avoid duplications from different jobs(triggered at different times). In order to avoid duplication on the same job, we're using a GroupByKey apache beam method, where the key is the hash value, and pick only the first element in the list. However, the duplication in bigquery still persists. My only hunch is that maybe, due to multiple workers handling the same job, they might be out of sync and process the same data, but since I'm using pipelines all the way, this assumption sounds unreasonable(at least to me..). Does any of you have an idea why the problem still persists?

Here's the job which creates the duplication:

with beam.Pipeline(options=options) as p:
# read fields of interest from the source table
    records = p | 'Read Records from BigQuery' >> beam.io.Read(
            beam.io.ReadFromBigQuery(query=read_from_source_query, use_standard_sql=True))

#step 1 - filter already existing records

# read existing hashes from the target table        
    hashes = p | 'read existing hashes from the target table' >> \
                 beam.io.Read(beam.io.ReadFromBigQuery(
                     query=select_hash_value_from_target_table,
                     use_standard_sql=True)) | \
                 'Get vals' >> beam.Map(lambda hash: hash['HashValue'])

# add hash value to each record and filter out the ones which already exist in the target table
    hashed_records = (
                records
                | 'Add Hash Column in Memory to Each source table Record' >> beam.Map(lambda record: add_hash_field(record))
                | 'Filter Existing Hashes' >> beam.Filter(lambda record,
                                                                 hashes: record['HashValue'] not in hashes,
                                                          hashes=beam.pvalue.AsIter(hashes))
        )

# step 2 - filter duplicated hashes created on the same job
    key_val_records = (
                hashed_records | 'Create a Key Value Pair' >> beam.Map(lambda record: (record['HashValue'], record))
        )

# combine elements with the same key and get only one of them
    unique_hashed_records = (
                key_val_records | 'Combine the Same Hashes' >> beam.GroupByKey()
                | 'Get First Element in Collection' >> beam.Map(lambda element: element[1][0])
        )

    records_to_store = unique_hashed_records | 'Create Records to Store' >> beam.ParDo(CreateTargetTableRecord(gal_options))

    records_to_store | 'Write to target table' >> beam.io.WriteToBigQuery(
                target_table)

As the code above suggested, i've expected to have no duplicates in the target table, but i'm still getting

ori
  • 11
  • 3
  • We're having a similar challenge with BQ and pubsub, and we solved it by allowing duplicates in BQ and de-deplicate data in queries. BR – Roar S. Jan 15 '23 at 16:51
  • You may have to identify where are the duplicates coming from. Sometimes they mean that an operation had to be retried, due to not finishing on time / erroring. But as Roar mentioned, you can deduplicate on the query using analytical queries/RANK, or write to an intermediate table and replicate to your final table using the MERGE DML statement. – Bruno Volpato Jan 15 '23 at 17:18
  • 1
    See this question and the article discussed there https://stackoverflow.com/questions/52520614/what-does-reshuffling-in-the-context-of-exactly-once-processing-in-bigquery-sin/52561828#52561828 – Michael Entin Jan 15 '23 at 21:57

0 Answers0