3

We've defined a basic subscriber that skips over failed messages (ie for some business logic reason, we are not going to handle) by throwing an exception and relying on a Akka Streams' stream supervision to resume the Flow:

someLagomService
  .someTopic()
  .subscribe
  .withGroupId("lagom-service")
  .atLeastOnce(
    Flow[Int]
      .mapAsync(1)(el => {
        // Exception may occur here or can map to Done
      })
      .withAttributes(ActorAttributes.supervisionStrategy({
        case t =>
          Supervision.Resume
      })
  )

This seems to work fine for basic use cases under very little load, but we have noticed very strange things for larger number of messages (ex: very frequent re-processing of messages, etc).

Digging into the code, we saw that Lagom's broker.Subscriber.atLeastOnce documentation states:

The flow may pull more elements from upstream but it must emit exactly one Done message for each message that it receives. It must also emit them in the same order that the messages were received. This means that the flow must not filter or collect a subset of the messages, instead it must split the messages into separate streams and map those that would have been dropped to Done.

Additionally, in the impl of Lagom's KafkaSubscriberActor, we see that the impl of private atLeastOnce essentially unzips the message payload and offset and then rezips then back up after our user flow maps messages to Done.

These two tidbits above seem to imply that by using stream supervisors and skipping elements, we can end up in a situation where the committable offsets no longer zip up evenly with the Dones that are to be produced per Kafka message.

Example: If we stream 1, 2, 3, 4 and map 1, 2, and 4 to Done but throw an exception on 3, we have 3 Dones and 4 committable offsets?

  • Is this correct / expected? Does this mean we should AVOID using stream supervisors here?
  • What sorts of behavior can the uneven zipping cause?
  • What is the recommended approach for error handling when it comes to consuming messages off of Kafka via the Lagom message broker API? Is the right thing to do to map / recover failures to Done?

Using Lagom 1.4.10

simonl
  • 1,240
  • 7
  • 19

1 Answers1

0

Is this correct / expected? Does this mean we should AVOID using stream supervisors here?

The official API documentations says that

If the Kafka Lagom message broker module is being used, then by default the stream is automatically restarted when a failure occurs.

So, there is no need to add your own supervisionStrategy to manage error handling. And the stream will be restarted by default and you should not think about "skipped" Done messages.


What sorts of behavior can the uneven zipping cause?

Exactly because of this the documentation says:

This means that the flow must not filter or collect a subset of the messages

It can under-commit the wrong offset. And on restart, you might get the already processed messages in the form of replay from committed lower offset.


What is the recommended approach for error handling when it comes to consuming messages off of Kafka via the Lagom message broker API? Is the right thing to do to map / recover failures to Done?

Lagom is taking care of the exception handling by dropping the message that caused the error and restarting the stream. And map / recover failures to Done won't have any change on this.

You could consider, in case you need to have access to these messages later on, too use Try {} for example, ie not throwing an exception, and collect the messages with errors by sending them to a different topic, this will give you chance to monitor the amount of errors and replay messages that caused the error when the conditions are right, ie the bug is fixed.

Ivan Stanislavciuc
  • 7,140
  • 15
  • 18
  • 2
    Thanks for the confirmation re: under-commits! Re: Lagom handling the exceptions and dropping the message, what we've observed is that while it does restart the stream handling the message, it never sends a commit to Kafka so post restart it retries the poison message over and over again. Definitely agree that would a good strategy to eventually "dead lettter" messages to different topics like you've alluded to but our temporary workaround was to `recover` and map errors to `Done` to unclog the pipes. – simonl Mar 12 '19 at 16:01
  • Thanks for explanation. I'd expect that the message is completely dropped. – Ivan Stanislavciuc Mar 12 '19 at 16:06