Pre-populate a bronze delta table from a silver table using a batch job, then stream to it from the same table

Asked Feb 15 '23 at 01:57

Active Feb 15 '23 at 01:57

Viewed 33 times

I have a pipeline like this:

kafka->bronze->silver

The bronze and silver tables are Delta Tables. I'm streaming from bronze to silver using regular spark structured-streaming.

I changed the silver schema, so I want to reload from the bronze into silver using the new schema. Unfortunately, the reload is taking forever, and I'm wondering if I can load the data more quickly using a batch job, and then turn the stream back on.

I am concerned that the checkpoint will tell the stream from bronze->silver to pick up where it left off and it will write a bunch of duplicates that I will then need to remove. Is there a way I can advance the checkpoint with the batch load, or play other tricks?

Will that be faster than just letting the stream run? I get the feeling that it is spending a lot of resources writing microbatch transactions.

Any suggestions greatly appreciated!!!

asked Feb 15 '23 at 01:57

user961826

Pre-populate a bronze delta table from a silver table using a batch job, then stream to it from the same table

0 Answers0