0

I have to write a simple service (Record ingester service) via which I need to consume messages present on apache pulsar and store them to elastic store and for that I am using com.sksamuel.pulsar4s.akka.

Messages on pulsar is produced by another service which is Record pump service. Both these services are to be deployed separately, in production.

Here is my source:

private val source = committableSource(consumerFn)

The above code works fine and its able to consume message from pulsar and write to ES. However, I am not sure if we should be using MessageId.earliest when creating source

private val source = committableSource(consumerFn, Some(MessageId.earliest))

While testing, I found pros and cons of both that is without using MessageId.earliest and with using MessageId.earliest, but none of them are suitable for production (as per my opinion).

1. Without using MessageId.earliest:

a. This adds a constraint that Record ingester service has to be up before we start Record pump service.

b. If my record ingester service goes down (due to an issue or due to maintenance), the messages produced on pulsar by record pump service will not get consumed after the ingester service is back up. This means that messages produced during the time, ingester service is down never gets consumed.

So, I think the logic is that only those messages will be consumed which will be put on pulsar AFTER the consumer has subscribed to that topic.

But, I don't think its acceptable in production for the reason mentioned in point a and point b.

2. With MessageId.earliest: Point a and b mentioned above are solved with this but -

When we use this, any time my record ingester service comes back up (after downtime or maintenance), it starts consuming all messages since the very beginning. I have the logic that records with same id gets overwritten at ES side, so it really doesn't do any harm but still I don't think this is the right way - as there would be millions of messages on that topic and it will everytime consume messages that are already consumed (which is a waste).

This also to me is unacceptable in production.

Can anyone please help me out in what configuration to use which solves both cases. I tried various configurations such as using subscriptionInitialPosition = Some(SubscriptionInitialPosition.Earliest) but no luck.

Complete code:

//consumer
  private val consumerFn = () =>
    pulsarClient.consumer(
      ConsumerConfig(
        subscriptionName = Subscription.generate,
        topics = Seq(statementTopic),
        subscriptionType = Some(SubscriptionType.Shared)
      )
    )
    
    //create source
    private val source = committableSource(consumerFn)
    
    
    //create intermediate flow
    private val intermediateFlow = Flow[CommittableMessage[Array[Byte]]].map {
    committableSourceMessage =>
      val message                     = committableSourceMessage.message
      val obj: MyObject = MyObject.parseFrom(message.value)
      WriteMessage.createIndexMessage(obj.id, JsonUtil.toJson(obj))
  }.via(
    ElasticsearchFlow.create(
      indexName = "myindex",
      typeName = "_doc",
      settings = ElasticsearchWriteSettings.Default,
      StringMessageWriter
    )
  )
    
    source.via(intermediateFlow).run()
user1270392
  • 2,981
  • 4
  • 21
  • 25

1 Answers1

0

What you would want is some form of compaction. See the Pulsar docs for details. You can make consumption compaction-aware with

ConsumerConfig(
  // other consumer config options as before
  readCompacted = Some(true)
)

There's a discussion in the Pulsar docs about the mechanics of compaction. Note that enabling compaction requires that writes to the topic be keyed, which may or may not have happened in the past.

Compaction can be approximated in Akka in a variety of ways, depending on how many distinct keys to compact on are in the topic, how often they're superceded by later messages, etc. The basic idea would be to have a statefulMapConcat which keeps a Map[String, T] in its state and some means of flushing the buffer.

A simple implementation would be:

Flow[CommittableMessage[Array[Byte]].map { csm =>
  Option(MyObject.parseFrom(csm.message.value))
}
.keepAlive(1.minute, () => None)
.statefulMapConcat { () =>
  var map: Map[String, MyObject] = Map.empty
  var count: Int = 0 
  { objOpt: Option[MyObject] =>
    objOpt.map { obj =>
      map = map.updated(obj.id, obj)
      count += 1
      if (count == 1000) {
        val toEmit = map.values.toList
        count = 0
        map = Map.empty
        toEmit
      } else Nil
    }.getOrElse {
      val toEmit = map.values.toList
      count = 0
      map = Map.empty
      toEmit
    }
  }

A more involved answer would be to create an actor corresponding to each object (cluster sharding may be of use here, especially if there are likely to be a lot of objects) and having the ingest from Pulsar send the incoming messages to the relevant actor, which then schedules a write of the latest message received to Elasticsearch.

One thing to be careful around with this is not committing offsets until you're sure the message (or a successor which supercedes it) has been written to Elasticsearch. If doing the actor per object approach, Akka Persistence may be of use: the basic strategy would be to commit the offset once the actor has acknowledged receipt (which occurs after persisting an event e.g. to Cassandra).

Levi Ramsey
  • 18,884
  • 1
  • 16
  • 30