Using kafka streams to segregate messages

Question

I have a setup where each kafka message will contain a "sender" field. All these message are sent to a single topic.

Is there a way to segregate these messages at the consumer side? I would like sender specific consumer that will read all messages pertaining to that sender alone.

Should I be using Kafka Streams to achieve this? I am new to Kafka Streams, any advice guidance will be helpful.

public class KafkaStreams3 {

public static void main(String[] args) throws JSONException {       

    Properties props = new Properties();
    props.put(StreamsConfig.APPLICATION_ID_CONFIG, "kafkastreams1");
    props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");

    props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

    final Serde < String > stringSerde = Serdes.String();

    Properties kafkaProperties = new Properties();
    kafkaProperties.put("key.serializer",
            "org.apache.kafka.common.serialization.StringSerializer");
    kafkaProperties.put("value.serializer",
            "org.apache.kafka.common.serialization.StringSerializer");

    kafkaProperties.put("bootstrap.servers", "localhost:9092");

    KafkaProducer<String, String> producer = new KafkaProducer<String, String>(kafkaProperties);



    KStreamBuilder builder = new KStreamBuilder();

   KStream<String, String> source = builder.stream(stringSerde, stringSerde, "topic1");


    KStream<String, String> s1 = source.map(new KeyValueMapper<String, String, KeyValue<String, String>>() {
        @Override
        public KeyValue<String, String> apply(String dummy, String record) {
            JSONObject jsonObject;

            try {
                jsonObject = new JSONObject(record);
                return new KeyValue<String,String>(jsonObject.get("sender").toString(), record);
            } catch (JSONException e) {
                e.printStackTrace();
                return new KeyValue<>(record, record);
            }

        }
      });

    s1.print();

    s1.foreach(new ForeachAction<String, String>() {

        @Override
        public void apply(String key, String value) {
            ProducerRecord<String, String> data1 = new ProducerRecord<String, String>(
                    key, key, value);
            producer.send(data1);

        }

    });

    KafkaStreams streams = new KafkaStreams(builder, props);

    streams.start();

    Runtime.getRuntime().addShutdownHook(new Thread(new Runnable() {
        @Override
        public void run() {
          streams.close();
          producer.close();
        }
      }));

}

}

groo · Accepted Answer · 2017-07-06T10:48:06.147

I believe the simplest way to achieve this is to use your "sender" field as a key and to have a single topic partitioned by "sender", this will give you locality and order per "sender" so you get a stronger ordering guarantee per "sender" and you can connect clients to consume from specific partitions.

Other possibility is that from the initial topic you stream your messages to other topics aggregating by key so you would end up having one topic per "sender".

Here's a fragment of code for a producer and then streaming with json serializers and deserializers.

Producer:

private Properties kafkaClientProperties() {
    Properties properties = new Properties();

    final Serializer<JsonNode> jsonSerializer = new JsonSerializer();

    properties.put("bootstrap.servers", config.getHost());
    properties.put("client.id", clientId);
    properties.put("key.serializer", StringSerializer.class);
    properties.put("value.serializer", jsonSerializer.getClass());

    return properties;
} 

public Future<RecordMetadata> send(String topic, String key, Object instance) {
    ObjectMapper objectMapper = new ObjectMapper();
    JsonNode jsonNode = objectMapper.convertValue(instance, JsonNode.class);
    return kafkaProducer.send(new ProducerRecord<>(topic, key,
            jsonNode));
}

The stream:

log.info("loading kafka stream configuration");
    final Serializer<JsonNode> jsonSerializer = new JsonSerializer();
    final Deserializer<JsonNode> jsonDeserializer = new JsonDeserializer();
    final Serde<JsonNode> jsonSerde = Serdes.serdeFrom(jsonSerializer, jsonDeserializer);

    KStreamBuilder kStreamBuilder = new KStreamBuilder();
    Properties props = new Properties();
    props.put(StreamsConfig.APPLICATION_ID_CONFIG, config.getStreamEnrichProduce().getId());
    props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, hosts);

    //stream from topic...
    KStream<String, JsonNode> stockQuoteRawStream = kStreamBuilder.stream(Serdes.String(), jsonSerde , config.getStockQuote().getTopic());

    Map<String, Map> exchanges = stockExchangeMaps.getExchanges();
    ObjectMapper objectMapper = new ObjectMapper();
    kafkaProducer.configure(config.getStreamEnrichProduce().getTopic());
    // - enrich stockquote with stockdetails before producing to new topic
    stockQuoteRawStream.foreach((key, jsonNode) -> {
        StockQuote stockQuote = null;
        StockDetail stockDetail;
        try {
            stockQuote = objectMapper.treeToValue(jsonNode, StockQuote.class);
        } catch (JsonProcessingException e) {
            e.printStackTrace();
        }
        JsonNode exchangeNode = jsonNode.get("exchange");
        // get stockDetail that matches current quote being processed
        Map<String, StockDetail> stockDetailMap = exchanges.get(exchangeNode.toString().replace("\"", ""));
        stockDetail = stockDetailMap.get(key);
        stockQuote.setStockDetail(stockDetail);
        kafkaProducer.send(config.getStreamEnrichProduce().getTopic(), null, stockQuote);
    });

    return new KafkaStreams(kStreamBuilder, props);

Thank you. I am currently trying the second approach, where I am aggregating it to other topic based on sender. But I get a feeling that it is an overhead since I will have to send it to sender specific topics which the consumer will have to read from. Rather, why don't I partition it per sender or write to separate topics itself. Is there any advantage that stream is giving us? — user2170956, Jul 05 '17 at 04:24
Hi, well the raw topic(the one with all messages) gives you a source of truth where you have all the messages in case you need to audit something(if you believe some messages are lost to a customer you can always replay from the initial one and check), Other advantage is that grouping the raw messages before sending to kafka has a bigger chance of out of memory errors and consequently losing messages that might not be persisted anywhere else and finally streaming is very optimized so I think you'll have a hard time making your client separation more efficient than streaming split.... — groo, Jul 05 '17 at 08:36
I have this sample code, but I am stuck on what to do after I perform a groupByKey(). This gives me a KGroupedStream. How do I read the grouped data from this ? — user2170956, Jul 05 '17 at 10:53
If I understood correctly from your initial question what you want is to have each "sender" to read it's full log of messages and not the latest one only correct? If this is the case you don't want the KGrouped you just want to stream your messages from your map to different topics based on the sender value. You can simply producing to a new topic or streaming to a new topic based on the "sender" you get from the JsonObject. KGroupedStream and KTables are used to get the latest entry of a specific key: https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/ — groo, Jul 05 '17 at 13:16
https://www.confluent.io/blog/distributed-real-time-joins-and-aggregations-on-user-activity-events-using-kafka-streams/ https://github.com/apache/kafka/tree/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples — groo, Jul 05 '17 at 13:16
Thanks @Marcos Maia. I understood my mistake. I also followed some similar topics such as this - https://stackoverflow.com/questions/41796207/dynamically-connecting-a-kafka-input-stream-to-multiple-output-streams#comment70784299_41796207 — user2170956, Jul 06 '17 at 06:32
I have modified my code above. It seems to work fine but can you confirm if I have done it correctly ? — user2170956, Jul 06 '17 at 06:33
Looking at your implementation I would consider storing the initial stream already in Json, using a jsonserde serializer, unless there's a specific requirement to not do so. This would avoid the first step where you're applying a map to create a stream with json, hence it would perform better. Other than that it looks ok. I will add a small example of a client using jsonserde serializar and deserializer on my post above. — groo, Jul 06 '17 at 10:42

Using kafka streams to segregate messages

1 Answers1