Apache Flink Stateful Reading File From S3

Asked Aug 21 '20 at 11:55

Active Aug 21 '20 at 16:47

Viewed 338 times

I have a flink batch job that reads a very large parquet file from S3 then it sinks a json into Kafka topic.

The problem is how can I make the file reading process stateful? I mean whenever the job interrupted or crushed, the job should start from previous reading state? I don't want send duplicate item to Kafka when the job restarted.

Here is my example code

val env = ExecutionEnvironment.getExecutionEnvironment
val input = Parquet.input[User](new Path(s"s3a://path"))
env.createInput(input)
  .filter(r => Option(r.token).getOrElse("").nonEmpty)

edited Aug 21 '20 at 16:47

user4157124

2,809
13
27
42

asked Aug 21 '20 at 11:55

mstzn

2,881
3
25
37

hi @mstzn did you find any solution for your problem – Piyush_Rana Sep 08 '20 at 22:03
Unfortunately it seems there is no strict state when you read from file. Flink read my parquet file chunk by chunk. And I set very short time for checkpoint when i read from a file. – mstzn Nov 07 '20 at 00:30

Apache Flink Stateful Reading File From S3

0 Answers0