Extract all schemaIDs of all messages in a kafka topic without the need to consume all messages

Question

I'm wondering if there's way (kafka API/tool) to return list of schemaIds used by messages under a topic from within kafka and/or schema registry.

I have a quick soluiton to consume all messages to extract from outside of kafka. However, it's kind of time and resource consuming.

I think the schema registry rest api don't expose that need, but instead, you can do a workaround to handle it. — Yassine CHABLI, Jan 06 '23 at 18:26

score 0 · Answer 1 · answered Jan 06 '23 at 18:37

0

First Solution:

First you can request all schemas by :

/schemas

The response is an array of object which each one contain a subject field that represent typically your topic name.

{
  "schema":{"type": "string"},
  "subject": "your target topic"},
  "version": "version number",
  "schema": {}, // schema that you are looking for 
  "id": "id"
}

Second Solutions:

/subjects/${your topic name }/versions/

The response is an array of versions ids like:

[1,2,3,..]

And you have to fetch for each version the wanted schema as:

/subjects/${your topic name }/versions/version // 1,2,3 etc

Check the schema registry rest api doc here

answered Jan 06 '23 at 18:37

Yassine CHABLI

3,459
2
23
43

`GET /schemas` is not a documented route. Seems like you meant `GET /schemas/{id}`, which result is an object, not array – OneCricketeer Jan 06 '23 at 21:37
Yes, but if you fetch it , you will get all the schemas. i don't know why it's not exposed. – Yassine CHABLI Jan 07 '23 at 17:49
Thanks. The issue I'm facing is that in SR, the schema subject is not topic name, neither does it follow topicName strategy. so there's no direct mapping between topic name and subject name. Therefore querying SR using known APIs seems not working in my environment. – Jin Ma Jan 08 '23 at 05:07
@jinma If you can extract ids from the data in the topic, then you can use the schemas route mentioned here for each ID. That, however doesn't reverse into getting all subject-versions using that ID. – OneCricketeer Jan 08 '23 at 15:22

score 0 · Accepted Answer · answered Jan 06 '23 at 21:40

0

Is there a way (kafka API/tool) to return list of schemaIds used by messages under a topic from within kafka

Yes.

kafka-avro-console-consumer ... --property print.schema.ids=true

https://github.com/confluentinc/schema-registry/pull/901

Without the need to consume

No; you need to deserialize at least the first 5 bytes of each record.

The other answer shows what is in the registry, which is not necessary what currently exists in the topic.

answered Jan 06 '23 at 21:40

OneCricketeer

179,855
19
132
245

Thanks. that's what I did in my code. poll records and extract first 5 bytes per each record. I'm wondering if there's any other more efficient way to achieve the same. – Jin Ma Jan 08 '23 at 05:10
Thanks, answers accepted. Seems like what I did so far is the best solution. I'm polling the avro format message with 1s poll timeout and only extract the first 5 bytes. I'd assume this won't impose big impact to kafka neither will cause OOM on my application side in case large amount of data in the topic. – Jin Ma Jan 09 '23 at 15:15
If you're simply printing the ID, then no. But if you have millions of unique ids and are storing in a list, then you might end up with OOM – OneCricketeer Jan 09 '23 at 15:17

Extract all schemaIDs of all messages in a kafka topic without the need to consume all messages

2 Answers2