I’m having some issues with my Kafka Streams implementation in production. I’ve implemented a function that takes a KTable
and a KStream
, and yields another KTable
with aggregated results based on the join of these two inputs. The idea is to iterate a list in the KStream
input, and for each iteration, join it with the KTable
, and aggregate into a list of events present in the KTable
and sink to a topic, containing the original KTable
event and the list of joined KStream
(1 to N join).
Context
This is how my component interacts with its context. MovementEvent
contains a list of transaction_ids
that should match the transaction_id
of TransactionEvent
, where the joiner should match & generate a new event (Sinked Event
) with the original MovementEvent
and a list of the matched TransactionEvent
.
For reference, the Movement topic has 12 million, while the Transaction topic has 21 million records.
Implementation
public class SinkEventProcessor implements BiFunction<
KTable<TransactionKey, Transaction>,
KStream<String, Movement>,
KTable<SinkedEventKey, SinkedEvent>> {
@Override
public KTable<SinkedEventKey, SinkedEvent> apply(final KTable<TransactionKey, Transaction> transactionTable,
final KStream<String, Movement> movementStream) {
return movementStream
// [A]
.flatMap((movementKey, movement) -> movement
.getTransactionIds()
.stream()
.distinct()
.map(transactionId -> new KeyValue<>(
TransactionKey.newBuilder()
.setTransactionId(transactionId)
.build(),
movement))
.toList())
// [B]
.join(transactionTable, (movement, transaction) -> Pair.newBuilder()
.setMovement(movement)
.setTransaction(transaction)
.build())
// [C]
.groupBy((transactionKey, pair) -> SinkedEventKey.newBuilder()
.setMovementId(pair.getMovement().getMovementId())
.build())
// [D]
.aggregate(SinkedEvent::new, (key, pair, collectable) ->
collectable.setMovement(pair.getMovement())
.addTransaction(pair.getTransaction()));
}
}
[A] I have started the implementation by iterating the Movement KStream
, extracting the transactionId and creating a TransactionKey
to use as the new key for the following operation, to facilitate the join with each transactionId
present in the Movement
entity. This operation returns a KStream<TransactionKey, Movement>
[B] Joins the formerly transformed KStream
and adds each value to an intermediate pair. Returns a `KStream<TransactionKey, Pair>.
[C] Groups the pairs by movementId
and constructs the new key (SinkedEventKey
) for the sink operation.
[D] Aggregates into the result object (SinkedEvent
) by adding the transaction
to the list. This operation will also sink to the topic as a KTable<SinkedEventKey, SinkedEvent>
Problem
The problem starts when we start processing the stream, the sink operation of the processor starts generating more records than it should. For instance, for a Movement with 4 transaction_id
, the output topic will become something like this:
partition | offset | count of [TransactionEvent] | expected count |
---|---|---|---|
0 | 1 | 1 | 4 |
0 | 2 | 2 | 4 |
0 | 3 | 4 | 4 |
0 | 4 | 4 | 4 |
And the same happens for other records (e.g. a Movement with 13 transaction_id
will yield 13 menssages). So for some reason that I can't compreehend, the aggregate
operation is sinking on each operation, instead of waiting and collecting into the list and sinking only once.
I've tried to reproduce it in a development cluster, with exactly the same settings, with no avail. Everything seems to be working properly when I try to reproduce it (a Movement with 8 transactions produces only 1 record) but whenever I bring it to production it doesn't work as intended. I'm not sure what I'm missing, any help?