1

I am designing a streaming pipeline where my need is to consume events from Kafka topic. Single Kafka topic can have data from around 1000 tables, data coming as a json record. Now I have below problems to solve.

  1. Reroute messages based on its table to its seperate folder: this is done using spark structured streaming partition by on table name.
  2. Second wanted to parse each json record attach appropriate table schema to it and create/append/update corresponding delta table. I am not able find best solution to do it. Infering json schema dyanamically and write to delta table dynamically. How this can be done? As records are coming as json string.
  3. As I have to process so many tables do I need to write those many number of streaming queries? How it can be solved?

Thanks

Rahul
  • 81
  • 1
  • 7
  • Why is everything in a single topic? – OneCricketeer Aug 22 '21 at 13:06
  • There are some constraints.. forget about all tables what if we use 100 tables per topic then also problem remains the same. – Rahul Aug 22 '21 at 13:42
  • I assume you mean 100 topics? Then, no, problem is different because you have one single schema for the entire topic and you can combine that with something like Confluent Schema Registry to dynamically provision the dataframe schemas. That solution involves using Avro, not JSON – OneCricketeer Aug 22 '21 at 13:48
  • I mean to say single topic with multiple tables. – Rahul Aug 22 '21 at 14:59
  • Right, I understood that from the question, and number ultimately doesn't matter as that simply won't scale for any consumer only interested in a specific few tables. You should ideally have one topic per table, for the reasons I said about the schema and secondly about the consumers... To simply answer the question, though, you'd simply need to write 1000 `df = source.filter` + `df.write` statements. And, as stated, suggest you use Avro, which has schemas part of the data, not dynamically generate from JSON messages themselves – OneCricketeer Aug 23 '21 at 05:02

0 Answers0