0

Use case: i have messages having messageId, multiple messages can have same message id, these messages are present in streaming pipeline (like kafka) partitioned by messageId, so i am making sure all the messages with same messageId will go in same partition.

So i need to write a job which should buffer messages for some time (let say 1 minute) and after that time, combine all messages having same messageId to single large message.

I am thinking it can be done using spark Datasets and spark sql (or something else?). But i could not find any example/documentation around how to store messages for some time for a given message id and then do aggregation on these messages.

Amit Kumar
  • 825
  • 1
  • 9
  • 19
  • What kind of aggregation are you thinking? Do you want an aggregate value (like a sum), or do you want a message of messages? – kellanburket Feb 10 '18 at 15:46
  • message of messages, assume you have 10 messages with same message id, my result should be 1 big message containing all 10 inside it. hope that clears. – Amit Kumar Feb 10 '18 at 16:55

1 Answers1

0

I think what you're looking for is Spark Streaming. Spark has a Kafka Connector that can link into a Spark Streaming Context.

Here's a really basic example that will create an RDD for all messages in a given set of topics over a 1 minute interval, then group them by a message id field (your value serializer would have to expose such a getMessageId method, of course).

SparkConf conf = new SparkConf().setAppName(appName);
JavaStreamingContext ssc = new JavaStreamingContext(conf, Durations.minutes(1));

Map<String, Object> params = new HashMap<String, Object>() {{
    put("bootstrap.servers", kafkaServers);
    put("key.deserializer", kafkaKeyDeserializer);
    put("value.deserializer", kafkaValueDeserializer);
}};

List<String> topics = new ArrayList<String>() {{
    // Add Topics
}};

JavaInputDStream<ConsumerRecord<String, String>> stream =
    KafkaUtils.createDirectStream(ssc,
        LocationStrategies.PreferConsistent(),
        ConsumerStrategies.<String, String>Subscribe(topics, params)
    );

stream.foreachRDD(rdd -> rdd.groupBy(record -> record.value().getMessageId()));

ssc.start();
ssc.awaitTermination(); 

There's several other ways to group the messages within the streaming API. Look at the documentation for more examples.

kellanburket
  • 12,250
  • 3
  • 46
  • 73