Buffer messages in stream data for a given messageId

Question

Use case: i have messages having messageId, multiple messages can have same message id, these messages are present in streaming pipeline (like kafka) partitioned by messageId, so i am making sure all the messages with same messageId will go in same partition.

So i need to write a job which should buffer messages for some time (let say 1 minute) and after that time, combine all messages having same messageId to single large message.

I am thinking it can be done using spark Datasets and spark sql (or something else?). But i could not find any example/documentation around how to store messages for some time for a given message id and then do aggregation on these messages.

What kind of aggregation are you thinking? Do you want an aggregate value (like a sum), or do you want a message of messages? — kellanburket, Feb 10 '18 at 15:46
message of messages, assume you have 10 messages with same message id, my result should be 1 big message containing all 10 inside it. hope that clears. — Amit Kumar, Feb 10 '18 at 16:55

score 0 · Answer 1 · answered Feb 12 '18 at 18:59

I think what you're looking for is Spark Streaming. Spark has a Kafka Connector that can link into a Spark Streaming Context.

Here's a really basic example that will create an RDD for all messages in a given set of topics over a 1 minute interval, then group them by a message id field (your value serializer would have to expose such a getMessageId method, of course).

SparkConf conf = new SparkConf().setAppName(appName);
JavaStreamingContext ssc = new JavaStreamingContext(conf, Durations.minutes(1));

Map<String, Object> params = new HashMap<String, Object>() {{
    put("bootstrap.servers", kafkaServers);
    put("key.deserializer", kafkaKeyDeserializer);
    put("value.deserializer", kafkaValueDeserializer);
}};

List<String> topics = new ArrayList<String>() {{
    // Add Topics
}};

JavaInputDStream<ConsumerRecord<String, String>> stream =
    KafkaUtils.createDirectStream(ssc,
        LocationStrategies.PreferConsistent(),
        ConsumerStrategies.<String, String>Subscribe(topics, params)
    );

stream.foreachRDD(rdd -> rdd.groupBy(record -> record.value().getMessageId()));

ssc.start();
ssc.awaitTermination();

There's several other ways to group the messages within the streaming API. Look at the documentation for more examples.

I think i can use that https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/windows.html — Amit Kumar, Feb 14 '18 at 10:01
yes. that's also an option. my answer is for Spark, which you asked about in the question. — kellanburket, Feb 14 '18 at 12:46

Buffer messages in stream data for a given messageId

1 Answers1