0

Let's say I have a kafka topic without any duplicate messages.

If I consumed this topic with spark structured streaming and added a column with currentTime() and partitioned by this time column and saved records to s3 would there be a risk of creating duplicates in s3 in case of some failures?

Or spark is smart enough to deliver these messages exactly once?

  • After reading the article https://medium.com/@Iqbalkhattra85/exactly-once-mechanism-in-spark-structured-streaming-7a27d8423560 Im assuming that it will be exactly once. The same files may be saved under different time partitions but only the commited one in _metadata folder will be read. Am I correct? – Konrad Paniec Oct 07 '22 at 23:25

0 Answers0