Use case: i have messages having messageId, multiple messages can have same message id, these messages are present in streaming pipeline (like kafka) partitioned by messageId, so i am making sure all the messages with same messageId will go in same partition.
So i need to write a job which should buffer messages for some time (let say 1 minute) and after that time, combine all messages having same messageId to single large message.
I am thinking it can be done using spark Datasets and spark sql (or something else?). But i could not find any example/documentation around how to store messages for some time for a given message id and then do aggregation on these messages.