I have a flink batch job that reads a very large parquet file from S3 then it sinks a json into Kafka topic.
The problem is how can I make the file reading process stateful? I mean whenever the job interrupted or crushed, the job should start from previous reading state? I don't want send duplicate item to Kafka when the job restarted.
Here is my example code
val env = ExecutionEnvironment.getExecutionEnvironment
val input = Parquet.input[User](new Path(s"s3a://path"))
env.createInput(input)
.filter(r => Option(r.token).getOrElse("").nonEmpty)