Spark streaming reads file twice from NFS

Question

I am using spark streaming (spark 2.4.6) to read data files from NFS mount point. However, sometimes spark streaming job checkpoints files differently for different batches, hence it produces duplicates. Does anyone have similar issue?

Here is example of checkpoint:

$ hdfs dfs -cat checkpoint/App/sources/0/15419.compact | grep 'export_dat_20210923.gz'

{"path":"file:///data_uploads/app/export_dat_20210923.gz","timestamp":1632398460000,"batchId":14994} {"path":"file:/data_uploads/app/export_dat_20210923.gz","timestamp":1632398460000,"batchId":14997}

Can you add the code for reading the data? Specifically which paths are you passing and if you use globbed paths or not? — Yosi Dahari, Oct 01 '21 at 06:53
reading data part: `sourceDF = self.spark \ .readStream \ .option("maxFileAge", self.maximumFileAge) \ .option("maxFilesPerTrigger", self.maximumFilesPerTrigger) \ .csv(self.inputDir+self.fileNamePattern, header="false", sep=separator, quote="\"", mode="PERMISSIVE", schema=schema, columnNameOfCorruptRecord='corrupt_record')` — kevi, Oct 06 '21 at 13:58
"input_dir": "file:/data_uploads/app/", "pattern": "export_dat_*.gz" — kevi, Oct 06 '21 at 14:00
Yes, it seems that globbed paths were causing this. I have eliminated them and it is now working fine. — kevi, Oct 21 '21 at 09:17

Yosi Dahari · Answer 1 · 2021-10-01T07:02:48.510

Exactly once guarantee comes with multiple assumptions about the source (replayable), checkpoint (HDFS-compatible fault-tolerant) and sink (idempotent).

When writing files using structured streaming, out of the box, you won't always get idempotency. If different batches write to different files, or partition, this is something that can cause duplicates by design. For example, as described in this article, using globbed paths results in duplicates.

The problem is described in this article.

There are several idempotent targets (e.g. ElasticSearch), and also suggestions how to write in an idempotent manner, for example:

You can create idempotent sinks by implementing logic that first checks for the existence of the incoming result in the datastore. If the result already exists, the write should appear to succeed from the perspective of your Spark job, but in reality your data store ignored the duplicate data. If the result doesn't exist, then the sink should insert this new result into its storage.

Spark streaming reads file twice from NFS

1 Answers1