1

I’m having some issues with my Kafka Streams implementation in production. I’ve implemented a function that takes a KTable and a KStream, and yields another KTable with aggregated results based on the join of these two inputs. The idea is to iterate a list in the KStream input, and for each iteration, join it with the KTable, and aggregate into a list of events present in the KTable and sink to a topic, containing the original KTable event and the list of joined KStream (1 to N join).

Context

wanted topology

This is how my component interacts with its context. MovementEvent contains a list of transaction_ids that should match the transaction_id of TransactionEvent, where the joiner should match & generate a new event (Sinked Event) with the original MovementEvent and a list of the matched TransactionEvent.

For reference, the Movement topic has 12 million, while the Transaction topic has 21 million records.

Implementation
public class SinkEventProcessor implements BiFunction<
        KTable<TransactionKey, Transaction>,
        KStream<String, Movement>,
        KTable<SinkedEventKey, SinkedEvent>> {

    @Override
    public KTable<SinkedEventKey, SinkedEvent> apply(final KTable<TransactionKey, Transaction> transactionTable,
                                                     final KStream<String, Movement> movementStream) {
        return movementStream
                // [A]
                .flatMap((movementKey, movement) -> movement
                        .getTransactionIds()
                        .stream()
                        .distinct()
                        .map(transactionId -> new KeyValue<>(
                                TransactionKey.newBuilder()
                                        .setTransactionId(transactionId)
                                        .build(),
                                movement))
                        .toList())
                // [B]
                .join(transactionTable, (movement, transaction) -> Pair.newBuilder()
                        .setMovement(movement)
                        .setTransaction(transaction)
                        .build())
                // [C]
                .groupBy((transactionKey, pair) -> SinkedEventKey.newBuilder()
                        .setMovementId(pair.getMovement().getMovementId())
                        .build())
                // [D]
                .aggregate(SinkedEvent::new, (key, pair, collectable) ->
                        collectable.setMovement(pair.getMovement())
                                .addTransaction(pair.getTransaction()));
    }
}

[A] I have started the implementation by iterating the Movement KStream, extracting the transactionId and creating a TransactionKey to use as the new key for the following operation, to facilitate the join with each transactionId present in the Movement entity. This operation returns a KStream<TransactionKey, Movement>

[B] Joins the formerly transformed KStream and adds each value to an intermediate pair. Returns a `KStream<TransactionKey, Pair>.

[C] Groups the pairs by movementId and constructs the new key (SinkedEventKey) for the sink operation.

[D] Aggregates into the result object (SinkedEvent) by adding the transaction to the list. This operation will also sink to the topic as a KTable<SinkedEventKey, SinkedEvent>

Problem

The problem starts when we start processing the stream, the sink operation of the processor starts generating more records than it should. For instance, for a Movement with 4 transaction_id, the output topic will become something like this:

partition offset count of [TransactionEvent] expected count
0 1 1 4
0 2 2 4
0 3 4 4
0 4 4 4

And the same happens for other records (e.g. a Movement with 13 transaction_id will yield 13 menssages). So for some reason that I can't compreehend, the aggregate operation is sinking on each operation, instead of waiting and collecting into the list and sinking only once.

I've tried to reproduce it in a development cluster, with exactly the same settings, with no avail. Everything seems to be working properly when I try to reproduce it (a Movement with 8 transactions produces only 1 record) but whenever I bring it to production it doesn't work as intended. I'm not sure what I'm missing, any help?

hedz
  • 111
  • 1
  • 7

0 Answers0