I am designing a streaming pipeline where my need is to consume events from Kafka topic. Single Kafka topic can have data from around 1000 tables, data coming as a json record. Now I have below problems to solve.
- Reroute messages based on its table to its seperate folder: this is done using spark structured streaming partition by on table name.
- Second wanted to parse each json record attach appropriate table schema to it and create/append/update corresponding delta table. I am not able find best solution to do it. Infering json schema dyanamically and write to delta table dynamically. How this can be done? As records are coming as json string.
- As I have to process so many tables do I need to write those many number of streaming queries? How it can be solved?
Thanks