0

I have a stream that uses foreachBatch and keeps checkpoints in a data lake, but if I cancel the stream, it happens that the last write is not fully commited. Then the next time I start the stream I get duplicates, since it starts from the last commited batchId.

I use delta but I don't want to use the merge because I have a lot of data and it doesn't seem to be as performant as I would like (even using partitions).

How can I use the batchId to handle the duplicates? Or is there some other way?

i61
  • 1
  • Is there any reason you cannot use `writeStream.format("delta").(...)`? The Delta sink will help you handle this automatically. – zsxwing Feb 28 '20 at 18:24
  • @zsxwing I do use `writeStream.format("delta").foreachBatch(process_to_dl).(...)` but I append all the rows to the detla table. I don't understand what you mean by the Delta sink will handle it automatically, can you clarify? – i61 Mar 03 '20 at 09:20
  • `format("delta")` and `foreachBatch` cannot be used at the same time. `foreachBatch` is like `format("foreachBatch")`. Use `format("delta")` should be enough unless you need to do complicated things in `process_to_dl`. – zsxwing Mar 05 '20 at 07:33

0 Answers0