0

I've found: Apache Kafka with Avro and Schema Repo - where in the message does the schema Id go?

It's clear to me the schema ID is part of every message produced to Kafka. I can't find it being said explicitly but I assume not only schema ID but also schema version is encoded in the message?

If not then I wonder why? The consumers need to know not only schema ID but also exact version in order to deserialise it. I know a bit about schema-registry compatibility settings that could ensure even the consumers with older schema can read messages produced with new schema but why would anyone even do that if the exact version can be part of message together with ID.

welcomeboredom
  • 565
  • 3
  • 12

1 Answers1

1

The registry server maintains a global ID. That combined with the subject name (topic name, by default, suffixed with key/value) maps to a version. In the API you can see that ID under /subjects/:name/versions/:version endpoint.

The consumer client doesn't require a version to perform deserialization as it uses /schemas/ids/:id directly. A consumer can optionally choose to create/download a specific version of a schema. After which, the Avro spec defines what happens for evolution rules using the writer (server side) and reader (client side) schemas.

And a producer, by default, will always send its schema to the registry, which will always be the "latest" version unless the schema was registered before (MD5 hashed for comparison). This hits POST /subjects/:name/versions endpoint to upload a new version (after compared against existing ones server-side).

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Thank you I didn't notice the ID attached to each message is global. I have assumed the ID == subject. – welcomeboredom Aug 15 '23 at 18:04
  • Maybe I can ask you to explain what is purpose of schema backwards/forward compatibility then? I've seen mentions about 'if we need to rewind the consumer' but even in situation like that every message comes with exact schema (subject + version). Consumer can just use multiple schemas to correctly deserialise all topic content and not hope that things are backwards compatible (for example). – welcomeboredom Aug 15 '23 at 18:08
  • Example for backwards compatibility is writing the data to HDFS/S3. In those cases, there is no ID to lookup; the actual schema is placed into the files. While you can provide a "newer schema" to read all historical data, there is no straightforward way to use only the "most recent version" without downloading that from the Registry first, or otherwise supplying some "compatible schema" manually, such as can be done with Apache Hive. Kafka Consumers work similarly when using `SpecificRecord` Avro subclasses. – OneCricketeer Aug 15 '23 at 22:31
  • And "multiple versions" only works for `GenericRecord` types, however there is no way to conditionally check the version with numeric checks, only by (string) field-name existence. For forward compatibility, it gives more change control to consumers, in that they can change the schema to what they expect without relying on (or being broken by) producer changes. Imagine these are separate teams that are interested in different fields of the payloads – OneCricketeer Aug 15 '23 at 22:32