1

I'm attempting to index messages in Elasticsearch using SMT's from Kafka's Connect API only.

So far I had luck with simply using the topic and timestamp router functionality. However, now I'd like to create separate indices based on a certain field in the message.

Suppose the messages are formatted as such:

{"productId": 1, "category": "boat", "price": 135000}
{"productId": 1, "category": "helicopter", "price": 300000}
{"productId": 1, "category": "car", "price": 25000}

Would it somehow be possible to index these to the following indices based on product category?

  • product-boat
  • product-helicopter
  • product-car

or would I have to create separate topics for every single category (knowing that it could become hundreds or thousands of them)?

Am I overseeing a transform that could do this or is this simply not possible and will a custom component have to be built?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Julius
  • 11
  • 2

2 Answers2

0

There's nothing out of the box with Kafka Connect that will do this. You have a few options:

  1. The Elasticsearch sink connector will route messages to a target index based on its topic, so you could write a custom SMT that would inspect a message and route it to a different topic accordingly
  2. Use a stream processor to pre-process the messages such that they're already on different topics by the time they are consumed by the Elasticsearch sink connector. For example, Kafka Streams or KSQL.
    • KSQL you would need to hard code each category (CREATE STREAM product-boat AS SELECT * FROM messages WHERE category='boat' etc)
    • Kafka Streams now has Dynamic Routing (KIP-303) which would be a more flexible way of doing it
  3. Handcode a bespoke Elasticsearch sink connector with the logic coded in to route the messages to indices based on message contents. This feels like the worst of the three approach IMO.
Robin Moffatt
  • 30,382
  • 3
  • 65
  • 92
0

If you are using Confluent Platform you can do some kind of routing depends on field value in the message.

To do that you have to use ExtractTopic SMT from Confluent. More details regarding that SMT can be found at https://docs.confluent.io/current/connect/transforms/extracttopic.html#extracttopic

Kafka Sink Connector processes messages, that are represented by SinkRecord. Each SinkRecord contains of several fields: topic, partition, value, key, etc. Those fields are set by Kafka Connect and using transformation you can change those value. ExtractTopic SMT changes value of topic based on value or key of the message.

Transformations configuration will be something like that:

{
...
    "transforms": "ExtractTopic",
    "transforms.ExtractTopic.type": "io.confluent.connect.transforms.ExtractTopic$Value",
    "transforms.ExtractTopic.field": "name",  <-- name of field, that value will be used as index name
...
}

One limitation is, that you have to create indices in advance.

How I assume you are using Elasticsearch Sink Connector. Elasticsearch connector has ability to create index, but it does it when its opens - method to create writers for particular partition (ElasticsearchSinkTask::open). In your use case at that moment all indices can't be created, because value of all messages are not available.

Maybe it isn't the purest approach, because ExtractTopic should be rather used for Source connectors, but in you case it might work.

Bartosz Wardziński
  • 6,185
  • 1
  • 19
  • 30