Lagom PubSubRef subscriber drops messages

Question

[Attention] The question is Lagom framework specific!

In my current project, the problem with cutting the list of messages from Source to Kafka topic publisher has been observed when upstream is of high speed and looks like downstream can't handle all messages in time. As realized, the cutting is related to the behavior of PubSubRef.subscribe() method https://github.com/lagom/lagom/blob/master/pubsub/javadsl/src/main/scala/com/lightbend/lagom/javadsl/pubsub/PubSubRef.scala#L85

The full method definition:

def subscriber(): Source[T, NotUsed] = {
scaladsl.Source.actorRef[T](bufferSize, OverflowStrategy.dropHead)
  .mapMaterializedValue { ref =>
    mediator ! Subscribe(topic.name, ref)
    NotUsed
  }.asJava
}

There's OverflowStrategy.dropHead is used. Can it be changed to use back-pressure strategy?

UPD#1: The use case is pretty simple, when a query request is published into command topic, get it and query objects from DB table, the resulting list is pushed into result Kafka topic. Code snippet:

objectsResultTopic = pubSub.refFor(TopicId.of(CustomObject.class, OBJECTS_RESULT_TOPIC));
objectQueryTopic().subscribe().atLeastOnce(
Flow.fromSinkAndSource(
    Flow.fromFunction(this::deserializeCommandAndQueryObjects)
        .mapAsync(concurrency, objects -> objects)
        .flatMapMerge(concurrency, objects -> objects)
        .alsoTo(Sink.foreach(event -> LOG.trace("Sending object {}", object)))
        .to(objectsResultTopic.publisher()),
    Source.repeat(Done.getInstance())
    )
)

In case of objects stream generated by deserializeCommandAndQueryObjects function is more than default buffer-size = 1000 it starts cutting the elements (our case is ~ 2.5k objects).

UPD#2: The source of objects data is:

// returns CompletionStage<Source<CustomObject, ?>>
jdbcSession.withConnection(
  connection -> Source.from(runQuery(connection, rowConverter))
)

And there's a subscription to Kafka objectsResultTopic:

TopicProducer.singleStreamWithOffset(
offset -> objectsResultTopic.subscriber().map(gm -> {
    JsonNode node = mapper.convertValue(gm, JsonNode.class);
    return Pair.create(node, offset);
}));

What is the source of the data that is published into `objectQueryTopic`? And what is subscribing to `objectsResultTopic` To reiterate, the API you use here for `objectsResultTopic` _does not_ use Kafka. — Tim Moore, Jul 04 '17 at 09:53
I may confuse you by a mess of code snippets, but the main idea is to get PubSubRef `objectsResultTopic` with subscription to result Kafka topic = OBJECTS_RESULT_TOPIC and send objects loaded from DB source through the flow into objectsResultTopic.publisher() to publish them in result topic. — HarshRomash, Jul 04 '17 at 10:31
This may be getting a bit complex for a StackOverflow Q&A, yes :) I think the bottom line is that the `PubSubRef` is not the best tool for this job. It sounds to me like you're trying to read from a Kafka topic, transform the data, then write the results to another topic. Is that correct? — Tim Moore, Jul 06 '17 at 02:11

score 4 · Answer 1 · answered Jul 04 '17 at 01:04

It sounds like Lagom's distributed publish-subscribe feature may not be the best tool for the job you have.

Your question mentions Kafka, but this feature does not make use of Kafka. Instead, it works by directly broadcasting messages to all subscribers in the cluster. This is an "at most once" messaging transport that may indeed lose messages, and is intended for consumers who care more about keeping up with recent messages than processing every single one. The overflow strategy is not customizable, and you would not want to use back-pressure in these use cases, as it would mean that one slow consumer could slow down delivery to all of the other subscribers.

There are a few other options that you have:

If you do want to use Kafka, you should use Lagom's message broker API. This supports "at least once" delivery semantics, and can be used to ensure that each consumer processes every message (at the cost of possibly increasing latency).

In this case, Kafka acts as a giant durable buffer, so it's even better than back-pressure: the producer and consumer can proceed at different paces, and (when used with partitioning) you can add consumers in order to scale out and process messages more quickly when needed.

The message broker API can be used when producers and consumers are all in the same service, but it is particularly suitable for communication between services.
If the messages you are sending are persistent entity events, and the consumers are part of the same service, then a persistent read-side processor might be a good option.

This also provides "at least once" delivery, and if the only effects of processing messages are database updates, then the built-in support for Cassandra read-side databases and relational read-side databases provide "effectively once" semantics, where the database updates are run transactionally to ensure that failures that occur during event processing cannot result in partial updates.
If the messages you are sending are persistent entity events, the consumers are part of the same service, but you want to process the events as a stream, you can access a raw stream of events.
If your use case does not fit into one of the use cases that Lagom supports explicitly, you can use lower-level Akka APIs, including distributed publish-subscribe, to implement something more tailored to your needs.

The best choice will depend on the specifics of your use case: the source of the messages and the types of consumers you want. If you update your question with more details and add a comment to this answer, I can edit the answer with more specific suggestions.

Hi Tim! Thank you for your response, I've updated the question and added the real use case with code snippet which is affected by this problem. — HarshRomash, Jul 04 '17 at 07:58

score 1 · Accepted Answer · answered Jul 05 '17 at 10:56

If someone is interested, finally we solved that problem by using Akka Producer API, like:

ProducerSettings<String, CustomObject> producerSettings = ProducerSettings.create(system, new StringSerializer(), new CustomObjectSerializer());
objectQueryTopic().subscribe().atLeastOnce(
Flow.fromSinkAndSource(
    Flow.fromFunction(this::deserializeCommandAndQueryObjects)
        .mapAsync(concurrency, objects -> objects)
        .flatMapMerge(concurrency, objects -> objects)
        .alsoTo(Sink.foreach(object -> LOG.trace("Sending event {}", object)))
        .map(object -> new ProducerRecord<String, CustomObject>(OBJECTS_RESULT_TOPIC, object))
        .to(Producer.plainSink(producerSettings)),
    Source.repeat(Done.getInstance())));

It works without buffering, just makes the pushing into Kafka topic.

Oh great! I posted my last comment before reading that you solved the problem. This looks like a good solution to me. — Tim Moore, Jul 06 '17 at 02:16
@VRomaN How did this solution work out for you? I tried doing something similar, but it seems like in cases where upstream has no demand, downstream will still send Done downstream. Flow.fromSinkAndSourceCoupled has the same behaviour. I tested this only with a vanilla akka-streams implementation - not with the Lagom topic.subscribe.atLeastOnce coupling. — ISJ, May 10 '18 at 12:12

Lagom PubSubRef subscriber drops messages

2 Answers2