25

I want to use Avro to serialize the data for my Kafka messages and would like to use it with an Avro schema repository so I don't have to include the schema with every message.

Using Avro with Kafka seems like a popular thing to do, and lots of blogs / Stack Overflow questions / usergroups etc reference sending the Schema Id with the message but I cannot find an actual example of where it should go.

I think it should go in the Kafka message header somewhere but I cannot find an obvious place. If it was in the Avro message you would have to decode it against a schema to get the message contents and reveal the schema you need to decode against, which has obvious problems.

I am using the C# client but an example in any language would be great. The message class has these fields:

public MessageMetadata Meta { get; set; }
public byte MagicNumber { get; set; }
public byte Attribute { get; set; }
public byte[] Key { get; set; }
public byte[] Value { get; set; }

but non of these seem correct. The MessageMetaData only has Offset and PartitionId.

So, where should the Avro Schema Id go?

jheppinstall
  • 2,338
  • 4
  • 23
  • 27
  • Perhaps the most popular format for sending Avro messages to Kafka is [Confluent Wire Format](https://docs.confluent.io/platform/current/schema-registry/fundamentals/serdes-develop/index.html#wire-format). It's implemented in the Confluent's `KafkaAvroSerializer` / `KafkaAvroDeserializer`, whose behaviour is described in @serejja's answer. – Ilya Serbis Apr 13 '23 at 22:06

1 Answers1

39

The schema id is actually encoded in the avro message itself. Take a look at this to see how encoders/decoders are implemented.

In general what's happening when you send an Avro message to Kafka:

  1. The encoder gets the schema from the object to be encoded.
  2. Encoder asks the schema registry for an id for this schema. If the schema is already registered you'll get an existing id, if not - the registry will register the schema and return the new id.
  3. The object gets encoded as follows: [magic byte][schema id][actual message] where magic byte is just a 0x0 byte which is used to distinguish that kind of messages, schema id is a 4 byte integer value the rest is the actual encoded message.

When you decode the message back here's what happens:

  1. The decoder reads the first byte and makes sure it is 0x0.
  2. The decoder reads the next 4 bytes and converts them to an integer value. This is how schema id is decoded.
  3. Now when the decoder has a schema id it may ask the schema registry for the actual schema for this id. Voila!

If your key is Avro encoded then your key will be of the format described above. The same applies for value. This way your key and value may be both Avro values and use different schemas.

Edit to answer the question in comment:

The actual schema is stored in the schema repository (that is the whole point of schema repository actually - to store schemas :)). The Avro Object Container Files format has nothing to do with the format described above. KafkaAvroEncoder/Decoder use slightly different message format (but the actual messages are encoded exactly the same way sure).

The main difference between these formats is that Object Container Files carry the actual schema and may contain multiple messages corresponding to that schema, whereas the format described above carries only the schema id and exactly one message corresponding to that schema.

Passing object-container-file-encoded messages around would probably be not obvious to follow/maintain because one Kafka message would then contain multiple Avro messages. Or you could ensure that one Kafka message contains only one Avro message but that would result in carrying schema with each message.

Avro schemas can be quite large (I've seen schemas like 600 KB and more) and carrying the schema with each message would be really costly and wasteful so that is where schema repository kicks in - the schema is fetched only once and gets cached locally and all other lookups are just map lookups that are fast.

serejja
  • 22,901
  • 6
  • 64
  • 72
  • Hi serejja, do you know anywhere where the encoding scheme is stated? The spec at https://avro.apache.org/docs/1.7.7/spec.html talks about Object Container Files containing the full schema but I don't think this is the same as you describe. – jheppinstall Jul 03 '15 at 22:25
  • 1
    Thanks @serejja, I guess my question was more like how did the Confluent guys decide to use [magic byte][schema id][actual message] as the message format? did they define it, or is it specified somewhere else? – jheppinstall Jul 06 '15 at 08:49
  • 3
    Hi @serejja, Have you encountered with different lib (which is more popular) to handle this issue? I have done a quick review on: https://github.com/linkedin/camus/tree/master/camus-kafka-coders/src/main/java/com/linkedin/camus/etl/kafka/coders and t's seems like an interesting source, – user2550587 Oct 28 '15 at 13:15
  • Yes, I'm aware of this library however as I know it is not possible to integrate Camus with Confluent Schema Registry – serejja Oct 28 '15 at 13:38
  • 1
    Thanks @serejja for clarification. Though while testing the schema registry, I found a strange behavior. If the same message is sent to two different topics, the schemas are registered separately for both the topics. I was expecting the schemas to be same across multiple topics. – Bhushan Mar 28 '17 at 10:23
  • @serejja Can we extract the schema id from the message – Don Sam Dec 15 '22 at 02:55
  • @serejja Can we extract the schema id from the message – loneStar Feb 07 '23 at 19:29
  • @DonSam. Sure, just take bites from second to fifth and convert them to the int. – Ilya Serbis Apr 13 '23 at 17:11