0

I would like to explain my problem statement by explaing below scenario first.

Scenario : I am working on continuos file reading using flink's PROCESS_CONTINOUS mode using flink+java8.

This is actually a batch reading kind of functionality in which different files will received at diffeent timings in a day. So let say file_1.csv arrives at 3:00 PM then my flink job would read this file. Again file-2.csv arrives at 3:30PM then flink job will read this file as well and the process will continue working in this way till job stops. We sink these data to Kafka.

Problem : When i restart the flink job then it start reading all the earlier read files' data.Which means i am getting same records again and again as i restart the job.

Is there any way of preventing data duplicacy?

MiniSu
  • 566
  • 1
  • 6
  • 22

1 Answers1

2

It sounds like you are throwing away the job's state when you restart. If you do a stateful restart by restarting from a checkpoint or savepoint, then the new job should pick up from where the previous one left off.

See https://ci.apache.org/projects/flink/flink-docs-stable/docs/try-flink/flink-operations-playground/#upgrading--rescaling-a-job for more info.

David Anderson
  • 39,434
  • 4
  • 33
  • 60