How to stop flink reading duplicate data while reading csv file

Question

I would like to explain my problem statement by explaing below scenario first.

Scenario : I am working on continuos file reading using flink's PROCESS_CONTINOUS mode using flink+java8.

This is actually a batch reading kind of functionality in which different files will received at diffeent timings in a day. So let say file_1.csv arrives at 3:00 PM then my flink job would read this file. Again file-2.csv arrives at 3:30PM then flink job will read this file as well and the process will continue working in this way till job stops. We sink these data to Kafka.

Problem : When i restart the flink job then it start reading all the earlier read files' data.Which means i am getting same records again and again as i restart the job.

Is there any way of preventing data duplicacy?

score 2 · Answer 1 · answered Sep 08 '21 at 00:38

2

It sounds like you are throwing away the job's state when you restart. If you do a stateful restart by restarting from a checkpoint or savepoint, then the new job should pick up from where the previous one left off.

See https://ci.apache.org/projects/flink/flink-docs-stable/docs/try-flink/flink-operations-playground/#upgrading--rescaling-a-job for more info.

answered Sep 08 '21 at 00:38

David Anderson

39,434
4
33
60

Let me try this one. – MiniSu Sep 08 '21 at 05:16

How to stop flink reading duplicate data while reading csv file

1 Answers1