I have to write a simple service (Record ingester service) via which I need to consume messages present on apache pulsar and store them to elastic store and for that I am using com.sksamuel.pulsar4s.akka.
Messages on pulsar is produced by another service which is Record pump service. Both these services are to be deployed separately, in production.
Here is my source:
private val source = committableSource(consumerFn)
The above code works fine and its able to consume message from pulsar and write to ES. However, I am not sure if we should be using MessageId.earliest when creating source
private val source = committableSource(consumerFn, Some(MessageId.earliest))
While testing, I found pros and cons of both that is without using MessageId.earliest and with using MessageId.earliest, but none of them are suitable for production (as per my opinion).
1. Without using MessageId.earliest:
a. This adds a constraint that Record ingester service has to be up before we start Record pump service.
b. If my record ingester service goes down (due to an issue or due to maintenance), the messages produced on pulsar by record pump service will not get consumed after the ingester service is back up. This means that messages produced during the time, ingester service is down never gets consumed.
So, I think the logic is that only those messages will be consumed which will be put on pulsar AFTER the consumer has subscribed to that topic.
But, I don't think its acceptable in production for the reason mentioned in point a and point b.
2. With MessageId.earliest: Point a and b mentioned above are solved with this but -
When we use this, any time my record ingester service comes back up (after downtime or maintenance), it starts consuming all messages since the very beginning. I have the logic that records with same id gets overwritten at ES side, so it really doesn't do any harm but still I don't think this is the right way - as there would be millions of messages on that topic and it will everytime consume messages that are already consumed (which is a waste).
This also to me is unacceptable in production.
Can anyone please help me out in what configuration to use which solves both cases. I tried various configurations such as using subscriptionInitialPosition = Some(SubscriptionInitialPosition.Earliest) but no luck.
Complete code:
//consumer
private val consumerFn = () =>
pulsarClient.consumer(
ConsumerConfig(
subscriptionName = Subscription.generate,
topics = Seq(statementTopic),
subscriptionType = Some(SubscriptionType.Shared)
)
)
//create source
private val source = committableSource(consumerFn)
//create intermediate flow
private val intermediateFlow = Flow[CommittableMessage[Array[Byte]]].map {
committableSourceMessage =>
val message = committableSourceMessage.message
val obj: MyObject = MyObject.parseFrom(message.value)
WriteMessage.createIndexMessage(obj.id, JsonUtil.toJson(obj))
}.via(
ElasticsearchFlow.create(
indexName = "myindex",
typeName = "_doc",
settings = ElasticsearchWriteSettings.Default,
StringMessageWriter
)
)
source.via(intermediateFlow).run()