I've been trying to use reactive-kafka, and I am having a problem with conditional processing, to which I didn't find a satisfying answer.
Basically I'm trying to consume one kafka topic which contains a huge number of messages (around 10 billion messages a day), and only process a few of those messages (a few thousands a day) based on some property of the message, then push the processed version of my message to another topic, and I am struggling to do that properly.
My first attempt was something like :
// This is pseudo code.
Source(ProducerSettings(...))
.filter(isProcessable(_))
.map(process(_))
.via(Producer.flow(producerSettings))
.map(_.commitScalaDsl())
.runWith(Sink.ignore)
The problem with this approach is that I only commit when I read messages that I am able to process which is obviously not cool because if I have to stop and restart my program, then I have to re-read a bunch of useless messages, and since there are so much of them, I can't afford to do it that way.
I have then tried to use the GraphDSL by doing something around the lines of:
in ~> broadcast ~> isProcessable ~> process ~> producer ~> merge ~> commit
~> broadcast ~> isNotProcessable ~> merge
This solution is obviously not good either because messages that I can't process go through the second branch of the graph and get committed before the processable messages are really pushed to their destination, which is kind of worse than the first message because it does not even guarantee at-least-once delivery.
Does anybody have an idea about how I could solve this problem ?