4

Apologies if this is already covered in Kafka's documentation or a guide, and I'd be thankful if somebody could point me to that. I've found a lot of documents and articles that cover the basics of how to use Avro with Kafka and the Schema Registry, but I've struggled to find strategies or patterns for how to organize schema for use in multiple places.

Consider the following scenario: you're building data processing pipelines using Kafka, Kafka Streams, and KSQL. While building that pipeline, you find yourself wanting to create some reusable logic and data structures, so you create some data structures that will be used in multiple topics. For example, your pipeline processes a lot of records about people, so you want to create a Person schema like the following to use in multiple topics and other schema:

{
  "type": "record",
  "name": "Person",
  "fields": [
    {
      "name": "first_name",
      "type": "string"
    },
    {
      "name": "last_name",
      "type": "string"
    }
  ]
}

You want to use this schema in several topics, like PeopleWithAccounts and PeopleWhoBoughtItemX and other topics. You also want to use this schema in other schema, like:

{
  "type": "record",
  "name": "Order",
  "fields": [
    {
      "name": "itemId",
      "type": "int"
    },
    {
      "name": "purchaser",
      "type": "Person"
    }
  ]
}

In this scenario, it would be great to be able to define the Person schema independently of a topic, but to still have schema that use topics as their subjects. Based on the Schema Registry Naming Strategies documentation, it looks like clients can be configured to use either topics or records as the subject of schema across all topics/schema. But in this scenario, it would be nice to be able to set that kind of configuration on a per-schema basis. Also, this documentation points out that KSQL, non-Java Kafka clients, and other tools only work with the TopicNameStrategy, which would suggest that that strategy needs to be used for messages in topics that will be used by those tools/clients.

All this leads me to think that the only reasonable solution for "shared schema" is to define the shared portions (e.g. the Person type) in every topic that it is used in. Does this sound like a reasonable conclusion? Are there tools available that ease defining a "shared schema" and including that in other schema?

tl;dr: Are there any patterns/strategies/best practices for how to organize Avro schema that will be used by multiple topics as either the top level schema of those topics or as fields within other schema?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Bill Bushey
  • 143
  • 1
  • 6
  • 1
    Personally, I use Avro IDL, which [support `import` statements](https://avro.apache.org/docs/1.8.2/idl.html#imports) – OneCricketeer Oct 04 '19 at 19:24
  • 3
    I've also used Avro IDL, which allows you to define all the schemas in one place, but effectively still results in multiple, separate top-level schemas in which Person is embedded in each. While this has worked, I've been keeping an eye out for a better solution. Thank you for asking this question. I'm curious to see how others have tackled this. – blachniet Oct 05 '19 at 23:03
  • 1
    Thank you for the pointers to Avro IDL. I don't know how I didn't come across that before, that does definitely look like a useful tool for this need. – Bill Bushey Oct 14 '19 at 21:57

1 Answers1

0

If you want schemas to inherit from other schemas, you can use Schema References in Confluent Cloud to accomplish that. There's also a blog post that goes into more detail (although the use-case is a little different: using Avro references to create multiple types of events in a single topic).