I'm generating the same type of objects in multiple transforms (they are events of my process). The input of the pipeline is FileIO.MatchAll so PCollections are unbounded. Then, I create PCollectionList and Flatten them so I can apply BigQueryIO.Write only once.
PCollection<> files = pipeline.apply(
FileIO.match()
.withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW)
.filepattern(pattern)
.continuously(Duration.standardSeconds(60), Watch.Growth.never())
)
PCollectionTuple t1 = files.apply(transform1);
PCollectionTuple t2 = t1.get(RESULTS).apply(transform2);
PCollection<Event> events = PCollectionList
.of(t1.get(EVENTS))
.and(t2.get(EVENTS))
.apply(Flatten.pCollections());
events.apply(
BigQueryIO.write().to(targetDynamicDestinations)
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withMethod(Method.FILE_LOADS)
.withFormatFunction(...)
.withTriggeringFrequency(Duration.standardSeconds(60))
);
When I run the pipeline on Dataflow, the only successful write to BigQuery happens after the first poll by FileIO.MatchAll. After that, Flatten doesn't output any more records.
The problem is present only when I use Dataflow Runner v2 (--experiments=use_runner_v2
). Flatten works as expected with Runner v1. I use Java SDK (2.49.0).