I'm running an hourly dataflow job that reads records from a source table, processes and writes them to a target table. Since some of the records may repeat in the source table, we've created a hash value based on the record fields of interest, append it to the read source table records(in memory), and filter out the existing hashes already stored on the target table(the hash value will be stored in the target table). This way we aim to avoid duplications from different jobs(triggered at different times). In order to avoid duplication on the same job, we're using a GroupByKey apache beam method, where the key is the hash value, and pick only the first element in the list. However, the duplication in bigquery still persists. My only hunch is that maybe, due to multiple workers handling the same job, they might be out of sync and process the same data, but since I'm using pipelines all the way, this assumption sounds unreasonable(at least to me..). Does any of you have an idea why the problem still persists?
Here's the job which creates the duplication:
with beam.Pipeline(options=options) as p:
# read fields of interest from the source table
records = p | 'Read Records from BigQuery' >> beam.io.Read(
beam.io.ReadFromBigQuery(query=read_from_source_query, use_standard_sql=True))
#step 1 - filter already existing records
# read existing hashes from the target table
hashes = p | 'read existing hashes from the target table' >> \
beam.io.Read(beam.io.ReadFromBigQuery(
query=select_hash_value_from_target_table,
use_standard_sql=True)) | \
'Get vals' >> beam.Map(lambda hash: hash['HashValue'])
# add hash value to each record and filter out the ones which already exist in the target table
hashed_records = (
records
| 'Add Hash Column in Memory to Each source table Record' >> beam.Map(lambda record: add_hash_field(record))
| 'Filter Existing Hashes' >> beam.Filter(lambda record,
hashes: record['HashValue'] not in hashes,
hashes=beam.pvalue.AsIter(hashes))
)
# step 2 - filter duplicated hashes created on the same job
key_val_records = (
hashed_records | 'Create a Key Value Pair' >> beam.Map(lambda record: (record['HashValue'], record))
)
# combine elements with the same key and get only one of them
unique_hashed_records = (
key_val_records | 'Combine the Same Hashes' >> beam.GroupByKey()
| 'Get First Element in Collection' >> beam.Map(lambda element: element[1][0])
)
records_to_store = unique_hashed_records | 'Create Records to Store' >> beam.ParDo(CreateTargetTableRecord(gal_options))
records_to_store | 'Write to target table' >> beam.io.WriteToBigQuery(
target_table)
As the code above suggested, i've expected to have no duplicates in the target table, but i'm still getting