0

I'm using python to read messages coming from various topics. Some topics have got their messages encoded in plain JSON, while others are using Avro binary serialization, with confluent schema registry.

When I receive a message I need to know if it has to be decoded. At the moment I'm only relying on the fact that the binary encoded messages are starting with a MAGIC_BYTE which value is zero:

from confluent_kafka.cimpl import Consumer

consumer = Consumer(config)
consumer.subsrcibe(...)
msg = consumer.poll()
# check the msg is not null or error etc
if msg.values()[0] == 0:
      # It is binary encoded
else:
      # It is json

I was wondering is there's a better way to do that?

Phantômaxx
  • 37,901
  • 21
  • 84
  • 115
0x26res
  • 11,925
  • 11
  • 54
  • 108

2 Answers2

1

You can simply query the schema registry through REST first, and build a local cache of the topics that are registered there. Then, when you're trying to decode a message from a particular topic, simply compare the topic to the contents of that list. If it's there, you know it has be decoded.

Of course, this only works if all the topics that are Avro encoded are using Schema Registry. If you ever receive an Avro-encoded message that is not registered with Schema Registry, then it won't work.

mjuarez
  • 16,372
  • 11
  • 56
  • 73
1

You could get bytes 0-5 of your message, then

magic_byte = message_bytes[0]
schema_id = message_bytes[1:5]

Then, perform a lookup against your registry for GET /schemas/{schema_id}, and cache the ID + schema (if needed) when you get a 200 response code.

Otherwise, the message is either JSON, or the producer had sent its data to a different registry (if there is more than one in your environment). Note: this means the data could still be Avro

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245