0

I have a pipeline like this:

kafka->bronze->silver

The bronze and silver tables are Delta Tables. I'm streaming from bronze to silver using regular spark structured-streaming.

I changed the silver schema, so I want to reload from the bronze into silver using the new schema. Unfortunately, the reload is taking forever, and I'm wondering if I can load the data more quickly using a batch job, and then turn the stream back on.

I am concerned that the checkpoint will tell the stream from bronze->silver to pick up where it left off and it will write a bunch of duplicates that I will then need to remove. Is there a way I can advance the checkpoint with the batch load, or play other tricks?

Will that be faster than just letting the stream run? I get the feeling that it is spending a lot of resources writing microbatch transactions.

Any suggestions greatly appreciated!!!

user961826
  • 564
  • 6
  • 14

0 Answers0