Apache Beam - Flink runner - FileIO.write - issues in S3 writes

Question

I am currently working on a Beam pipeline (2.23) (Flink runner - 1.8) where we read JSON events from Kafka and write the output in parquet format to S3.

We write to S3 after every 10 min.

We have observed that our pipeline sometimes stops writing to S3 after making minor non breaking code changes and deploying pipeline, if we change kafka offset and restart pipeline it starts writing to S3 again.

While FileIO does not write to s3, Pipeline runs fine without any error/exception and it processes records until FileIO stage. It gives no error/exceptions in logs but silently fails to process anything at FileIO stage.

Watermark also does not progress for that stage and it shows watermark of the time when pipeline was stopped for deploy (savepoint time)

We have checked our Windowing function by logging records after windowing, windowing works fine.

Also if we replace FileIO with Kafka as output, pipeline runs fine and keep outputting records to kafka after deploys.

This is our code snippet -

parquetRecord.apply("Batch Events", Window.<GenericRecord>into(

FixedWindows.of(Duration.standardMinutes(Integer.parseInt(windowTime))))
                    .triggering(AfterWatermark.pastEndOfWindow())
                    .withAllowedLateness(Duration.ZERO,
Window.ClosingBehavior.FIRE_ALWAYS)
                    .discardingFiredPanes())

                    .apply(Distinct.create())

                    .apply(FileIO.<GenericRecord>write()
                            .via(ParquetIO.sink(getOutput_schema()))
                            .to(outputPath.isEmpty() ? outputPath() :
outputPath)
                            .withNumShards(1)
                            .withNaming(new
CustomFileNaming("snappy.parquet")));

Flink UI screenshot. It shows records are coming till FileIO.Write.

This is the stage where it is not sending any records out -

FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/GroupIntoShards ->
FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles/ParMultiDo(WriteShardsIntoTempFiles)
-> FileIO.Write/WriteFiles/GatherTempFileResults/Add void
key/AddKeys/Map/ParMultiDo(Anonymous)

Any idea what could be wrong here or any open bugs in Beam/Flink?

score 0 · Answer 1 · answered Feb 22 '21 at 19:27

0

It seems that no output is coming from this GroupByKey: https://github.com/apache/beam/blob/050b642b49f71a71434480c29272a64314992ee7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L674

This is because by default the output is re-windowed into the global window and the trigger is set to the default trigger.

You will need to add .withWindowedWrites to your FileIO configuration.

answered Feb 22 '21 at 19:27

Kenn Knowles

5,838
18
22

Is it possible to add `.withWindowedWrites` in `FileIO.Write` for `GenericRecord`. It given an error that method is not available. Is there any other way to specify it. Also, I see in FileIO this is set by default if `IgnoreWindowing` option is not specified. https://github.com/apache/beam/blob/050b642b49f71a71434480c29272a64314992ee7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1310. I am curious why it writes fine to s3 all the time except starting pipeline after code change. Is it always writing with global window all the time even when pipeline runs fine. – infiniti Feb 23 '21 at 10:50
s3 write works fine in all scenarios (restarts/code changes) if we remove `Distinct.create()` transform from our DAG – infiniti Apr 13 '21 at 09:16

score 0 · Answer 2 · answered Mar 19 '21 at 18:01

0

Have you tried increasing the .withNumShards(1)? We had an batch use case that is failing with Shards set to 1. Also writing to S3 from FlinkRunner. We think it is a bug with FlinkRunner.

answered Mar 19 '21 at 18:01

Andrew Lee

1

Apache Beam - Flink runner - FileIO.write - issues in S3 writes

2 Answers2