0

I am currently working on a Beam pipeline (2.23) (Flink runner - 1.8) where we read JSON events from Kafka and write the output in parquet format to S3.

We write to S3 after every 10 min.

We have observed that our pipeline sometimes stops writing to S3 after making minor non breaking code changes and deploying pipeline, if we change kafka offset and restart pipeline it starts writing to S3 again.

While FileIO does not write to s3, Pipeline runs fine without any error/exception and it processes records until FileIO stage. It gives no error/exceptions in logs but silently fails to process anything at FileIO stage.

Watermark also does not progress for that stage and it shows watermark of the time when pipeline was stopped for deploy (savepoint time)

We have checked our Windowing function by logging records after windowing, windowing works fine.

Also if we replace FileIO with Kafka as output, pipeline runs fine and keep outputting records to kafka after deploys.

This is our code snippet -

parquetRecord.apply("Batch Events", Window.<GenericRecord>into(

FixedWindows.of(Duration.standardMinutes(Integer.parseInt(windowTime))))
                    .triggering(AfterWatermark.pastEndOfWindow())
                    .withAllowedLateness(Duration.ZERO,
Window.ClosingBehavior.FIRE_ALWAYS)
                    .discardingFiredPanes())

                    .apply(Distinct.create())

                    .apply(FileIO.<GenericRecord>write()
                            .via(ParquetIO.sink(getOutput_schema()))
                            .to(outputPath.isEmpty() ? outputPath() :
outputPath)
                            .withNumShards(1)
                            .withNaming(new
CustomFileNaming("snappy.parquet")));

Flink UI screenshot. It shows records are coming till FileIO.Write.

This is the stage where it is not sending any records out -

FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/GroupIntoShards ->
FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles/ParMultiDo(WriteShardsIntoTempFiles)
-> FileIO.Write/WriteFiles/GatherTempFileResults/Add void
key/AddKeys/Map/ParMultiDo(Anonymous)

Flink UI screenshot

Any idea what could be wrong here or any open bugs in Beam/Flink?

infiniti
  • 61
  • 6

2 Answers2

0

It seems that no output is coming from this GroupByKey: https://github.com/apache/beam/blob/050b642b49f71a71434480c29272a64314992ee7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L674

This is because by default the output is re-windowed into the global window and the trigger is set to the default trigger.

You will need to add .withWindowedWrites to your FileIO configuration.

Kenn Knowles
  • 5,838
  • 18
  • 22
  • Is it possible to add `.withWindowedWrites` in `FileIO.Write` for `GenericRecord`. It given an error that method is not available. Is there any other way to specify it. Also, I see in FileIO this is set by default if `IgnoreWindowing` option is not specified. https://github.com/apache/beam/blob/050b642b49f71a71434480c29272a64314992ee7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1310. I am curious why it writes fine to s3 all the time except starting pipeline after code change. Is it always writing with global window all the time even when pipeline runs fine. – infiniti Feb 23 '21 at 10:50
  • s3 write works fine in all scenarios (restarts/code changes) if we remove `Distinct.create()` transform from our DAG – infiniti Apr 13 '21 at 09:16
0

Have you tried increasing the .withNumShards(1)? We had an batch use case that is failing with Shards set to 1. Also writing to S3 from FlinkRunner. We think it is a bug with FlinkRunner.