Apologies if this is already covered in Kafka's documentation or a guide, and I'd be thankful if somebody could point me to that. I've found a lot of documents and articles that cover the basics of how to use Avro with Kafka and the Schema Registry, but I've struggled to find strategies or patterns for how to organize schema for use in multiple places.
Consider the following scenario: you're building data processing pipelines using Kafka, Kafka Streams, and KSQL. While building that pipeline, you find yourself wanting to create some reusable logic and data structures, so you create some data structures that will be used in multiple topics. For example, your pipeline processes a lot of records about people, so you want to create a Person
schema like the following to use in multiple topics and other schema:
{
"type": "record",
"name": "Person",
"fields": [
{
"name": "first_name",
"type": "string"
},
{
"name": "last_name",
"type": "string"
}
]
}
You want to use this schema in several topics, like PeopleWithAccounts
and PeopleWhoBoughtItemX
and other topics. You also want to use this schema in other schema, like:
{
"type": "record",
"name": "Order",
"fields": [
{
"name": "itemId",
"type": "int"
},
{
"name": "purchaser",
"type": "Person"
}
]
}
In this scenario, it would be great to be able to define the Person
schema independently of a topic, but to still have schema that use topics as their subjects. Based on the Schema Registry Naming Strategies documentation, it looks like clients can be configured to use either topics or records as the subject of schema across all topics/schema. But in this scenario, it would be nice to be able to set that kind of configuration on a per-schema basis. Also, this documentation points out that KSQL, non-Java Kafka clients, and other tools only work with the TopicNameStrategy, which would suggest that that strategy needs to be used for messages in topics that will be used by those tools/clients.
All this leads me to think that the only reasonable solution for "shared schema" is to define the shared portions (e.g. the Person
type) in every topic that it is used in. Does this sound like a reasonable conclusion? Are there tools available that ease defining a "shared schema" and including that in other schema?
tl;dr: Are there any patterns/strategies/best practices for how to organize Avro schema that will be used by multiple topics as either the top level schema of those topics or as fields within other schema?