Kafka Connect to persist topic to Elasticsearch index using field from (json) message

Question

I'm attempting to index messages in Elasticsearch using SMT's from Kafka's Connect API only.

So far I had luck with simply using the topic and timestamp router functionality. However, now I'd like to create separate indices based on a certain field in the message.

Suppose the messages are formatted as such:

{"productId": 1, "category": "boat", "price": 135000}
{"productId": 1, "category": "helicopter", "price": 300000}
{"productId": 1, "category": "car", "price": 25000}

Would it somehow be possible to index these to the following indices based on product category?

product-boat
product-helicopter
product-car

or would I have to create separate topics for every single category (knowing that it could become hundreds or thousands of them)?

Am I overseeing a transform that could do this or is this simply not possible and will a custom component have to be built?

score 0 · Answer 1 · answered Feb 07 '19 at 11:14

There's nothing out of the box with Kafka Connect that will do this. You have a few options:

The Elasticsearch sink connector will route messages to a target index based on its topic, so you could write a custom SMT that would inspect a message and route it to a different topic accordingly
Use a stream processor to pre-process the messages such that they're already on different topics by the time they are consumed by the Elasticsearch sink connector. For example, Kafka Streams or KSQL.
- KSQL you would need to hard code each category (CREATE STREAM product-boat AS SELECT * FROM messages WHERE category='boat' etc)
- Kafka Streams now has Dynamic Routing (KIP-303) which would be a more flexible way of doing it
Handcode a bespoke Elasticsearch sink connector with the logic coded in to route the messages to indices based on message contents. This feels like the worst of the three approach IMO.

Bartosz Wardziński · Answer 2 · 2019-02-13T21:12:44.030

If you are using Confluent Platform you can do some kind of routing depends on field value in the message.

To do that you have to use ExtractTopic SMT from Confluent. More details regarding that SMT can be found at https://docs.confluent.io/current/connect/transforms/extracttopic.html#extracttopic

Kafka Sink Connector processes messages, that are represented by SinkRecord. Each SinkRecord contains of several fields: topic, partition, value, key, etc. Those fields are set by Kafka Connect and using transformation you can change those value. ExtractTopic SMT changes value of topic based on value or key of the message.

Transformations configuration will be something like that:

{
...
    "transforms": "ExtractTopic",
    "transforms.ExtractTopic.type": "io.confluent.connect.transforms.ExtractTopic$Value",
    "transforms.ExtractTopic.field": "name",  <-- name of field, that value will be used as index name
...
}

One limitation is, that you have to create indices in advance.

How I assume you are using Elasticsearch Sink Connector. Elasticsearch connector has ability to create index, but it does it when its opens - method to create writers for particular partition (ElasticsearchSinkTask::open). In your use case at that moment all indices can't be created, because value of all messages are not available.

Maybe it isn't the purest approach, because ExtractTopic should be rather used for Source connectors, but in you case it might work.

Kafka Connect to persist topic to Elasticsearch index using field from (json) message

2 Answers2