I'm using spark structured streaming to append to iceberg partitioned table. I need to use foreachBatch
or foreatch
as I'm using custom iceberg catalog implementation. (one from google biglake). Spark doc says foreatchBatch
is at-least once means It can replay same batch_id
batch in case of recovery from failure. I don't want duplicate records in iceberg table partition. How do I avoid this in best possible ways? couple options I can think of are:
- use
merge into using.. when not matched insert *
and insert only whenid
of records doesn't match - store
batch_id
in a persistant set and check every new batch against it? I don't see any example of that. Not even sure if that's a right way to do idempotent updates.