spark structured streaming + apache iceberg how appends can be idempotent

Asked Jul 20 '23 at 02:06

Active Jul 20 '23 at 02:06

Viewed 64 times

I'm using spark structured streaming to append to iceberg partitioned table. I need to use foreachBatch or foreatch as I'm using custom iceberg catalog implementation. (one from google biglake). Spark doc says foreatchBatch is at-least once means It can replay same batch_id batch in case of recovery from failure. I don't want duplicate records in iceberg table partition. How do I avoid this in best possible ways? couple options I can think of are:

use merge into using.. when not matched insert * and insert only when id of records doesn't match
store batch_id in a persistant set and check every new batch against it? I don't see any example of that. Not even sure if that's a right way to do idempotent updates.

asked Jul 20 '23 at 02:06

nir

3,743
4
39
63

spark structured streaming + apache iceberg how appends can be idempotent

0 Answers0