We've defined a basic subscriber that skips over failed messages (ie for some business logic reason, we are not going to handle) by throwing an exception and relying on a Akka Streams' stream supervision to resume the Flow
:
someLagomService
.someTopic()
.subscribe
.withGroupId("lagom-service")
.atLeastOnce(
Flow[Int]
.mapAsync(1)(el => {
// Exception may occur here or can map to Done
})
.withAttributes(ActorAttributes.supervisionStrategy({
case t =>
Supervision.Resume
})
)
This seems to work fine for basic use cases under very little load, but we have noticed very strange things for larger number of messages (ex: very frequent re-processing of messages, etc).
Digging into the code, we saw that Lagom's broker.Subscriber.atLeastOnce
documentation states:
The
flow
may pull more elements from upstream but it must emit exactly oneDone
message for each message that it receives. It must also emit them in the same order that the messages were received. This means that theflow
must not filter or collect a subset of the messages, instead it must split the messages into separate streams and map those that would have been dropped toDone
.
Additionally, in the impl of Lagom's KafkaSubscriberActor
, we see that the impl of private atLeastOnce
essentially unzips the message payload and offset and then rezips then back up after our user flow maps messages to Done
.
These two tidbits above seem to imply that by using stream supervisors and skipping elements, we can end up in a situation where the committable offsets no longer zip up evenly with the Done
s that are to be produced per Kafka message.
Example: If we stream 1, 2, 3, 4 and map 1, 2, and 4 to Done
but throw an exception on 3, we have 3 Done
s and 4 committable offsets?
- Is this correct / expected? Does this mean we should AVOID using stream supervisors here?
- What sorts of behavior can the uneven zipping cause?
- What is the recommended approach for error handling when it comes to consuming messages off of Kafka via the Lagom message broker API? Is the right thing to do to map / recover failures to
Done
?
Using Lagom 1.4.10