Handle Schema evolution while consuming messages from Kafka using PySpark

Question

I am new to Kafka. Currently I am working on a requirement -

Usecase: I am consuming messages from Kafka (The messages are produced in the Kafka by upstream team). The Upstream team doesn't maintain the schema versions and haven't implemented schema registry.

They have simply given an API in which I pass the client Id and table name, calling the API will give me a schema of the messages which I can write into S3 in a JSON format and use that file for parsing messages when I consume messages from Kafka.

Problem Now let's say, there is an addition of new columns in a particular table by the upstream team. What will be an optimum logic that will help me detect that schema has changed from the source side after which I will call the API again and store the latest schema in S3 and then start consuming the messages.

What I have tried ?

Before consuming the messages from Kafka, I run a script that fetch the response from the API and compute a hash value and compare it with the hash value of the schema JSON file that is already in S3. If the hash values don't match, then I call the API and store the latest schema in S3.

Issue

I have to do this for 300 clients and each client has 10-12 tables. This results in calling the API approximately 3600 times at a single point of time. The API won't be able to take this much load and this logic doesn't look optimized to me.

I am trying to come up with a logic that will reduce the number of API calls and at the same time, help me detect that schema evolution has happened from the source side.

Anyone here faced these types of scenarios before? Can you tell me what the best approach could be here.

Handle Schema evolution while consuming messages from Kafka using PySpark

0 Answers0