Kafka Stream aggregate function sinking each joined record in an incremental way, instead of a single list of aggregated records

Question

I’m having some issues with my Kafka Streams implementation in production. I’ve implemented a function that takes a KTable and a KStream, and yields another KTable with aggregated results based on the join of these two inputs. The idea is to iterate a list in the KStream input, and for each iteration, join it with the KTable, and aggregate into a list of events present in the KTable and sink to a topic, containing the original KTable event and the list of joined KStream (1 to N join).

Context

This is how my component interacts with its context. MovementEvent contains a list of transaction_ids that should match the transaction_id of TransactionEvent, where the joiner should match & generate a new event (Sinked Event) with the original MovementEvent and a list of the matched TransactionEvent.

For reference, the Movement topic has 12 million, while the Transaction topic has 21 million records.

Implementation

public class SinkEventProcessor implements BiFunction<
        KTable<TransactionKey, Transaction>,
        KStream<String, Movement>,
        KTable<SinkedEventKey, SinkedEvent>> {

    @Override
    public KTable<SinkedEventKey, SinkedEvent> apply(final KTable<TransactionKey, Transaction> transactionTable,
                                                     final KStream<String, Movement> movementStream) {
        return movementStream
                // [A]
                .flatMap((movementKey, movement) -> movement
                        .getTransactionIds()
                        .stream()
                        .distinct()
                        .map(transactionId -> new KeyValue<>(
                                TransactionKey.newBuilder()
                                        .setTransactionId(transactionId)
                                        .build(),
                                movement))
                        .toList())
                // [B]
                .join(transactionTable, (movement, transaction) -> Pair.newBuilder()
                        .setMovement(movement)
                        .setTransaction(transaction)
                        .build())
                // [C]
                .groupBy((transactionKey, pair) -> SinkedEventKey.newBuilder()
                        .setMovementId(pair.getMovement().getMovementId())
                        .build())
                // [D]
                .aggregate(SinkedEvent::new, (key, pair, collectable) ->
                        collectable.setMovement(pair.getMovement())
                                .addTransaction(pair.getTransaction()));
    }
}

[A] I have started the implementation by iterating the Movement KStream, extracting the transactionId and creating a TransactionKey to use as the new key for the following operation, to facilitate the join with each transactionId present in the Movement entity. This operation returns a KStream<TransactionKey, Movement>

[B] Joins the formerly transformed KStream and adds each value to an intermediate pair. Returns a `KStream<TransactionKey, Pair>.

[C] Groups the pairs by movementId and constructs the new key (SinkedEventKey) for the sink operation.

[D] Aggregates into the result object (SinkedEvent) by adding the transaction to the list. This operation will also sink to the topic as a KTable<SinkedEventKey, SinkedEvent>

Problem

The problem starts when we start processing the stream, the sink operation of the processor starts generating more records than it should. For instance, for a Movement with 4 transaction_id, the output topic will become something like this:

offset	count of [TransactionEvent]	expected count
1	1	4
2	2	4
3	4	4
4	4	4

And the same happens for other records (e.g. a Movement with 13 transaction_id will yield 13 menssages). So for some reason that I can't compreehend, the aggregate operation is sinking on each operation, instead of waiting and collecting into the list and sinking only once.

I've tried to reproduce it in a development cluster, with exactly the same settings, with no avail. Everything seems to be working properly when I try to reproduce it (a Movement with 8 transactions produces only 1 record) but whenever I bring it to production it doesn't work as intended. I'm not sure what I'm missing, any help?

Kafka Stream aggregate function sinking each joined record in an incremental way, instead of a single list of aggregated records

Context

Implementation

Problem

0 Answers0