Is spark structured streaming exactly-once when partitioning by system time?

Asked Oct 07 '22 at 17:46

Active Oct 07 '22 at 17:46

Viewed 36 times

Let's say I have a kafka topic without any duplicate messages.

If I consumed this topic with spark structured streaming and added a column with currentTime() and partitioned by this time column and saved records to s3 would there be a risk of creating duplicates in s3 in case of some failures?

Or spark is smart enough to deliver these messages exactly once?

asked Oct 07 '22 at 17:46

Konrad Paniec

After reading the article https://medium.com/@Iqbalkhattra85/exactly-once-mechanism-in-spark-structured-streaming-7a27d8423560 Im assuming that it will be exactly once. The same files may be saved under different time partitions but only the commited one in _metadata folder will be read. Am I correct? – Konrad Paniec Oct 07 '22 at 23:25

Is spark structured streaming exactly-once when partitioning by system time?

0 Answers0