My team has a Beam pipeline where we're writing an unbounded PCollection of domain objects to BigQuery using the BigQueryIO.write()
function. We're transforming the domain objects into TableRow objects inside of the BigQueryIO.write().withFormatFunction()
. WriteResult.getSuccessfulInserts()
gives us a PCollection of tableRows that were successfully written to BigQuery but, we would rather have access to the original domain objects again, specifically only the domain objects that were successfully written.
We've come up with one solution where we add a groupingKey
field to the domain objects and put the groupingKey
field into the TableRows when we do the withFormatFunction
call. This allows us to take the original input (PCollection<DomainObj>
), transform it into a PCollection<KV<String, DomainObj>>
where the String key is the groupingKey
, transform the output from writeResult.getSuccessfulTableRows
into a PCollection<KV<String,TableRow>>
where the String key is the groupingKey
, and then do a CoGroupByKey
operation on the KV keys to get a PCollection<KV<DomainObj, TableRow>>
, then we can just drop the TableRows and end up with the PCollection of successfully written DomainObjects. There are a couple reasons why this solution is undesirable:
- We thought of using the
BiqQueryIO.write().ignoreUnknownValues()
option to ensure the newgroupingKey
field we added to the TableRows doesn't end up in our BQ tables. This is a problem because our bigQuery schema is altered from time to time by an upstream applications and there are some occasional instances where we want unknown fields to be written to the table (we just don't want this groupingKey in the table). - The
CoGroupByKey
operation requires equal length windowing on its inputs and its possible that the BigQueryIO.write operation could exceed that window length. This would lead to us having to come up with complex solutions to handle items arriving past their window deadline.
Are there any more elegant solutions to write an unbounded PCollection of domain objects to BigQuery and end up with a PCollection of just the successfully written domain objects? Solutions that don't involve storing extra information in the TableRows are preferred. Thank you.