0

I'm using spark structured streaming to append to iceberg partitioned table. I need to use foreachBatch or foreatch as I'm using custom iceberg catalog implementation. (one from google biglake). Spark doc says foreatchBatch is at-least once means It can replay same batch_id batch in case of recovery from failure. I don't want duplicate records in iceberg table partition. How do I avoid this in best possible ways? couple options I can think of are:

  1. use merge into using.. when not matched insert * and insert only when id of records doesn't match
  2. store batch_id in a persistant set and check every new batch against it? I don't see any example of that. Not even sure if that's a right way to do idempotent updates.
nir
  • 3,743
  • 4
  • 39
  • 63

0 Answers0