-1

I'm generating the same type of objects in multiple transforms (they are events of my process). The input of the pipeline is FileIO.MatchAll so PCollections are unbounded. Then, I create PCollectionList and Flatten them so I can apply BigQueryIO.Write only once.

PCollection<> files = pipeline.apply(    
    FileIO.match()
        .withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW)
        .filepattern(pattern)
        .continuously(Duration.standardSeconds(60), Watch.Growth.never())
)
PCollectionTuple t1 = files.apply(transform1);
PCollectionTuple t2 = t1.get(RESULTS).apply(transform2);

PCollection<Event> events = PCollectionList
    .of(t1.get(EVENTS))
    .and(t2.get(EVENTS))
    .apply(Flatten.pCollections());

events.apply(
    BigQueryIO.write().to(targetDynamicDestinations)
    .withCreateDisposition(CreateDisposition.CREATE_NEVER)
    .withWriteDisposition(WriteDisposition.WRITE_APPEND)
    .withMethod(Method.FILE_LOADS)
    .withFormatFunction(...)
    .withTriggeringFrequency(Duration.standardSeconds(60))
);

When I run the pipeline on Dataflow, the only successful write to BigQuery happens after the first poll by FileIO.MatchAll. After that, Flatten doesn't output any more records.

The problem is present only when I use Dataflow Runner v2 (--experiments=use_runner_v2). Flatten works as expected with Runner v1. I use Java SDK (2.49.0).

skalski
  • 111
  • 1
  • 6
  • Is dataflow runner giving you any error messages? Could you elaborate on what you're trying to do? Which programming language and input source are you using? – kiran mathew Aug 16 '23 at 13:55
  • @kiranmathew I don't see any errors. I added a code snippet to show better what I want to do. And regarding Runner v2, my motivation behind the use of it is that Runner v1 doesn't support draining jobs using SplittableDoFn (https://stackoverflow.com/a/76544085/7241824). Flatten works as expected with Runner v1. – skalski Aug 21 '23 at 12:58
  • Can you try again after [adding](https://cloud.google.com/dataflow/docs/guides/logging#python) some logs to your code to confirm that there is no issue with the PCollection? Maybe the root cause of the issue is due to the PCollection. – kiran mathew Aug 28 '23 at 13:41
  • @kiranmathew Thanks for the suggestion. Indeed the problem isn't caused by Flatten but by BigQueryIO.Write. https://github.com/apache/beam/issues/28219 – skalski Aug 30 '23 at 11:31

0 Answers0