I am using spark streaming (spark 2.4.6) to read data files from NFS mount point. However, sometimes spark streaming job checkpoints files differently for different batches, hence it produces duplicates. Does anyone have similar issue?
Here is example of checkpoint:
$ hdfs dfs -cat checkpoint/App/sources/0/15419.compact | grep 'export_dat_20210923.gz'
{"path":"file:///data_uploads/app/export_dat_20210923.gz","timestamp":1632398460000,"batchId":14994} {"path":"file:/data_uploads/app/export_dat_20210923.gz","timestamp":1632398460000,"batchId":14997}