0

I have a flink batch job that reads a very large parquet file from S3 then it sinks a json into Kafka topic.

The problem is how can I make the file reading process stateful? I mean whenever the job interrupted or crushed, the job should start from previous reading state? I don't want send duplicate item to Kafka when the job restarted.

Here is my example code

val env = ExecutionEnvironment.getExecutionEnvironment
val input = Parquet.input[User](new Path(s"s3a://path"))
env.createInput(input)
  .filter(r => Option(r.token).getOrElse("").nonEmpty)
user4157124
  • 2,809
  • 13
  • 27
  • 42
mstzn
  • 2,881
  • 3
  • 25
  • 37
  • hi @mstzn did you find any solution for your problem – Piyush_Rana Sep 08 '20 at 22:03
  • Unfortunately it seems there is no strict state when you read from file. Flink read my parquet file chunk by chunk. And I set very short time for checkpoint when i read from a file. – mstzn Nov 07 '20 at 00:30

0 Answers0