I have a pipeline like this:
kafka->bronze->silver
The bronze and silver tables are Delta Tables. I'm streaming from bronze to silver using regular spark structured-streaming.
I changed the silver schema, so I want to reload from the bronze into silver using the new schema. Unfortunately, the reload is taking forever, and I'm wondering if I can load the data more quickly using a batch job, and then turn the stream back on.
I am concerned that the checkpoint will tell the stream from bronze->silver to pick up where it left off and it will write a bunch of duplicates that I will then need to remove. Is there a way I can advance the checkpoint with the batch load, or play other tricks?
Will that be faster than just letting the stream run? I get the feeling that it is spending a lot of resources writing microbatch transactions.
Any suggestions greatly appreciated!!!