2

Is there a way to automatically resolve the schema by the leading magic byte for each message, which contains the schema id for that message?

As we know, Confluent AVRO prepends the schema id to the message. So, each message has its own schema id embedded in it. ABRIS adds this id (the magic byte) when encoding the schema to support the Confluent format.

However, when decoding Confluent encoded messages, we must manually pass the schema configuration beforehand, making it hard to support cases where we can receive messages with different schemas (think of record strategy or even schema evolution).

I could implement a solution parsing manually each message magic byte and dynamically constructing the schema configurations for each message (or group of messages to save and avoid creating million of config objects).

Is there such a thing out of the box that I miss?


As an example use case, suppose in a micro-batch we read from a topic using RecordNameStrategy and in the same batch we receive different messages with different schemas and also slightly different versions for the same schema subject:

message value schema subject schema version schema id embedded in the message magic byte
***confluent-avro*** my.record.type1 v1 1
***confluent-avro*** my.record.type1 v2 2
***confluent-avro*** my.record.type2 v1 3
***confluent-avro*** my.record.type2 v2 4

Thank you very much.

YFl
  • 845
  • 7
  • 22
  • You can extract the ID manually using pure Spark functions, and with that, use your own UDF function to download the schema for that ID, but what's the exact use case? Getting the latest version for any subject should suffice, assuming the data is fully backwards compatible – OneCricketeer Jun 20 '22 at 13:57
  • I still didn't get how I could receive the schema ID? – Tal Jul 27 '22 at 09:09
  • 1
    See for example https://github.com/AbsaOSS/ABRiS#confluent-avro-format – YFl Jul 27 '22 at 10:16
  • Here it is documented https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format – YFl Jul 27 '22 at 10:19
  • 1
    @Tal, In confluent serialization it is embedded in the kafka message. So the avro serialized value would be something like 0idyourvalue. See the links in my previous comments. – YFl Jul 27 '22 at 10:22
  • I want to download the schema version according to the schema id I received from Kafka, but still, I cant find how can I get the schema id. Tank you! – Tal Jul 27 '22 at 13:49

0 Answers0