Auto-resolve schema by Confluent magic bytes using ABRiS for Spark

Question

Is there a way to automatically resolve the schema by the leading magic byte for each message, which contains the schema id for that message?

As we know, Confluent AVRO prepends the schema id to the message. So, each message has its own schema id embedded in it. ABRIS adds this id (the magic byte) when encoding the schema to support the Confluent format.

However, when decoding Confluent encoded messages, we must manually pass the schema configuration beforehand, making it hard to support cases where we can receive messages with different schemas (think of record strategy or even schema evolution).

I could implement a solution parsing manually each message magic byte and dynamically constructing the schema configurations for each message (or group of messages to save and avoid creating million of config objects).

Is there such a thing out of the box that I miss?

As an example use case, suppose in a micro-batch we read from a topic using RecordNameStrategy and in the same batch we receive different messages with different schemas and also slightly different versions for the same schema subject:

message value	schema subject	schema version	schema id embedded in the message magic byte
`*confluent-avro*`	my.record.type1	v1	1
`*confluent-avro*`	my.record.type1	v2	2
`*confluent-avro*`	my.record.type2	v1	3
`*confluent-avro*`	my.record.type2	v2	4

Thank you very much.

You can extract the ID manually using pure Spark functions, and with that, use your own UDF function to download the schema for that ID, but what's the exact use case? Getting the latest version for any subject should suffice, assuming the data is fully backwards compatible — OneCricketeer, Jun 20 '22 at 13:57
See for example https://github.com/AbsaOSS/ABRiS#confluent-avro-format — YFl, Jul 27 '22 at 10:16
Here it is documented https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format — YFl, Jul 27 '22 at 10:19
@Tal, In confluent serialization it is embedded in the kafka message. So the avro serialized value would be something like 0idyourvalue. See the links in my previous comments. — YFl, Jul 27 '22 at 10:22
I want to download the schema version according to the schema id I received from Kafka, but still, I cant find how can I get the schema id. Tank you! — Tal, Jul 27 '22 at 13:49

Auto-resolve schema by Confluent magic bytes using ABRiS for Spark

0 Answers0