How to use multiple transformers using the same topic for kafka streams?

Question

I need to parse complex messages on kafka using multiple transformers. Each transformer parses a part of the message and edits the message by filling some attributes on the message. In the end the fully parsed message is stored in the database using a Kafka consumer. Currently, I'm doing this:

streamsBuilder.stream(Topic.A, someConsumer)
       \\ filters messages that have unparsed parts of type X
       .filter(filterX)
       \\ transformer that edits the message and produces new Topic.E messages
       .transform(ParseXandProduceE::new)
       .to(Topic.A, someProducer)

streamsBuilder.stream(Topic.A, someConsumer)
       \\ filters messages that have unparsed parts of type Y
       .filter(filterY)
       \\ transformer that edits the message and produces new Topic.F messages
       .transform(ParseYandProduceF::new)
       .to(Topic.A, someProducer)

a Transformer looks like:

class ParseXandProduceE implements Transformer<...> {
    @Override
    public KeyValue<String, Message> transform (String key, Message message) {
           message.x = parse(message.rawX);
           context.forward(newKey, message.x, Topic.E);
           return KeyValue.pair(key, message);
    }
}

However, this is cumbersome, the same messages flow multiple times through these streams. Additionally, there is a consumer that stores messages of topic.A in the database. Messages are currently stored multiple times, before each transformation and after each transformation. It is necessary to store each message once.

The following could work, but seems unfavorable since each block of filter+transform could have been put cleanly in its own separate class:

streamsBuilder.stream(Topic.A, someConsumer)
       \\ transformer that filters and edits the message and produces new Topic.E + Topic.F messages
       .transform(someTransformer)
       .to(Topic.B, someProducer)

and make the persistence consumer listen to Topic.B.

Is the latter proposed solution the way to go, or is there some other way to achieve the same result? Maybe with a complete Topology configuration of Sources and Sinks? If so, what would that look like for this scenario?

score 1 · Accepted Answer · answered Jan 17 '21 at 19:59

Using a single transformer seems to be the simplest solution. Because you have two independent filters, the program would become more complex if you try to chain individual operators. If you know that each message will only pass a single filter, but never both filters, you could use branch():

KStream[] subStreams = stream.branch(new Predicates[]{filterX,filterY});

subStream[0].transform(ParseXandProduceE::new)
            .merge(subStream[1].transform(ParseYandProduceF::new)
            .to(...)

Note that the solution above only works if no message needs to be transformed by both transformers (branch() puts every message into the branch of the first matching predicate, but never into multiple branches). Thus, if a message could pass both filters, you need do something like this that is more complicated:

KStream[] subStreams = stream.branch(new Predicates[]{filterX,filterY});

KStream passedX = subStreams[0];
KStream transformedXE = passedX.transform(ParseXandProduceE::new);

// a message that passed filterX may also pass filterY,
// and thus we merge those message back to the "y-stream"
// (of course, those messages would already be transformed by `ParseXandProduceE`)
KStream passedY = subStream[1].merge(transformedXE.filter(filterY);

// the result contains all message that only pass filterX and got transformed,
// plus all messages that passed filterY (and maybe also filterX) and got transformed
KStream result = transformedXE.filterNot(filterY)
                              .merge(passedY.transform(ParseYandProduceF::new)

result.to(...)

How to use multiple transformers using the same topic for kafka streams?

1 Answers1