7

I am using spark streaming (spark 2.4.6) to read data files from NFS mount point. However, sometimes spark streaming job checkpoints files differently for different batches, hence it produces duplicates. Does anyone have similar issue?

Here is example of checkpoint:

$ hdfs dfs -cat checkpoint/App/sources/0/15419.compact | grep 'export_dat_20210923.gz'

{"path":"file:///data_uploads/app/export_dat_20210923.gz","timestamp":1632398460000,"batchId":14994} {"path":"file:/data_uploads/app/export_dat_20210923.gz","timestamp":1632398460000,"batchId":14997}

kevi
  • 71
  • 2
  • Can you add the code for reading the data? Specifically which paths are you passing and if you use globbed paths or not? – Yosi Dahari Oct 01 '21 at 06:53
  • reading data part: `sourceDF = self.spark \ .readStream \ .option("maxFileAge", self.maximumFileAge) \ .option("maxFilesPerTrigger", self.maximumFilesPerTrigger) \ .csv(self.inputDir+self.fileNamePattern, header="false", sep=separator, quote="\"", mode="PERMISSIVE", schema=schema, columnNameOfCorruptRecord='corrupt_record')` – kevi Oct 06 '21 at 13:58
  • "input_dir": "file:/data_uploads/app/", "pattern": "export_dat_*.gz" – kevi Oct 06 '21 at 14:00
  • Note the part of the answer about globbed paths – Yosi Dahari Oct 06 '21 at 14:03
  • Yes, it seems that globbed paths were causing this. I have eliminated them and it is now working fine. – kevi Oct 21 '21 at 09:17

1 Answers1

4

Exactly once guarantee comes with multiple assumptions about the source (replayable), checkpoint (HDFS-compatible fault-tolerant) and sink (idempotent).

When writing files using structured streaming, out of the box, you won't always get idempotency. If different batches write to different files, or partition, this is something that can cause duplicates by design. For example, as described in this article, using globbed paths results in duplicates.

The problem is described in this article.

There are several idempotent targets (e.g. ElasticSearch), and also suggestions how to write in an idempotent manner, for example:

You can create idempotent sinks by implementing logic that first checks for the existence of the incoming result in the datastore. If the result already exists, the write should appear to succeed from the perspective of your Spark job, but in reality your data store ignored the duplicate data. If the result doesn't exist, then the sink should insert this new result into its storage.

Yosi Dahari
  • 6,794
  • 5
  • 24
  • 44