4

My team has a Beam pipeline where we're writing an unbounded PCollection of domain objects to BigQuery using the BigQueryIO.write() function. We're transforming the domain objects into TableRow objects inside of the BigQueryIO.write().withFormatFunction(). WriteResult.getSuccessfulInserts() gives us a PCollection of tableRows that were successfully written to BigQuery but, we would rather have access to the original domain objects again, specifically only the domain objects that were successfully written.

We've come up with one solution where we add a groupingKey field to the domain objects and put the groupingKey field into the TableRows when we do the withFormatFunction call. This allows us to take the original input (PCollection<DomainObj>), transform it into a PCollection<KV<String, DomainObj>> where the String key is the groupingKey, transform the output from writeResult.getSuccessfulTableRows into a PCollection<KV<String,TableRow>> where the String key is the groupingKey, and then do a CoGroupByKey operation on the KV keys to get a PCollection<KV<DomainObj, TableRow>>, then we can just drop the TableRows and end up with the PCollection of successfully written DomainObjects. There are a couple reasons why this solution is undesirable:

  1. We thought of using the BiqQueryIO.write().ignoreUnknownValues() option to ensure the new groupingKey field we added to the TableRows doesn't end up in our BQ tables. This is a problem because our bigQuery schema is altered from time to time by an upstream applications and there are some occasional instances where we want unknown fields to be written to the table (we just don't want this groupingKey in the table).
  2. The CoGroupByKey operation requires equal length windowing on its inputs and its possible that the BigQueryIO.write operation could exceed that window length. This would lead to us having to come up with complex solutions to handle items arriving past their window deadline.

Are there any more elegant solutions to write an unbounded PCollection of domain objects to BigQuery and end up with a PCollection of just the successfully written domain objects? Solutions that don't involve storing extra information in the TableRows are preferred. Thank you.

  • great observations ! sorry about the delay on this - I think **currently** what you outline is the only option. This sounds like it could be a feature request for the Beam community: Add support to recover original data. – Pablo Jun 14 '22 at 22:23
  • @Pablo we are attempting the inverse. We have the same use case but with writeResult.getFailedInsertsWithErr() instead of writeResult.getSuccessfulTableRows() . We want to use the original domain object to populate some fields in our dead letter queue. Did you happen to find any alternatives to the one Joshua used above? – Roy Lara Jun 28 '22 at 21:25
  • I am also interested to know if you finally found a solution, thank you – Valentin Richer Dec 07 '22 at 14:25

0 Answers0