Apache Pulsar message delivery semantics

Question

I went through Apache Pulsar Documentation for Message Delivery Semantics. The delivery semantics mentioned for Apache functions(atleast once, atmost once and effective once), If we don't use Apache functions then what are all the different Delivery Semantics available?

miguno · Accepted Answer · 2022-07-15T17:31:40.393

TL;DR: Today, neither Pulsar Functions, Pulsar+Spark (you will see duplicates), nor Pulsar+Flink (you will see duplicates) support effectively-once semantics aka exactly-once semantics. Only in certain edge cases you can manually implement such semantics with a DIY setup. What Pulsar does support today are (1) at-most-once semantics = you may lose data and (2) at-least-once semantics = you will not lose data but may see duplicates.

Regarding (3) effectively-once support: I can certainly imagine that you have been confused. Despite claims in the Pulsar documentation to support effectively-once semantics, and several (misleading, unfortunately) blog articles on the subject (example), Pulsar in fact does not support this. What Pulsar does support is an idempotent producer and deduplication of messages. This functionality is indeed required but -- and this is the important aspect -- not sufficient for exactly-once semantics. The current functionality only works when producing one message and to only one partition. For example, you cannot atomically produce multiple messages to one partition with Pulsar today, let alone multiple partitions. It also means that interaction with state (e.g., for aggregating data like counting, performing joins between data streams) is not exactly-once.

What's missing, and when will Pulsar support exactly-once semantics? To guarantee exactly-once semantics, Pulsar must first add support for transactions. And this is indeed a planned feature with an original ETA for Pulsar 2.6.0 released in June 2020, but as of today there is still a lot of work left to be done. I am not aware of an updated ETA I'm afraid.

Where to learn more: A good Pulsar-specific source to understand this in more detail is the Dec 2019 presentation Apache Pulsar: Transactions Preview by Pulsar committers that summarizes the current lack of exactly-once support and explains why support for transactions in Pulsar is required to achieve it.

Another good source to understand this tricky subject in general is this 3-part article series on how exactly-once semantics are provided by Apache Kafka (blog series part1, part2, part3), which is a technology similar to Apache Pulsar. The series explains why idempotent producers are just one piece of the puzzle, and why transactions are needed (which utilize the former), and how this was designed and implemented in Apache Kafka, and released back in 2017. That's why you benefit from exactly-once semantics when processing data in Kafka with e.g. Kafka Streams (included in Kafka) or with Kafka and Apache Flink. If you look at Pulsar's plans and roadmap in 2020 to introduce exactly-once support, you can clearly see the very close parallels to Kafka's approach. As a user, the notable difference is that Kafka released all the functionality in one go (which also explains why it took the Kafka community several years to design, build, and test the feature), rather than piece-by-piece, which has made it much clearer to understand what is actually supported vs. what is not.

score 3 · Answer 2 · answered May 01 '20 at 09:49

3

Pulsar provides at-least-once semantics. It also can deduplicate writes to its log (termed idempotent production) and effectively-once consumption can be synthesized using and external data store (as with other messaging systems). For self-sufficient effectively/exactly-once processing, for example to do stream processing, you'd need to use Kafka or Flink.

answered May 01 '20 at 09:49

Ben Stopford

156
2

Where can I find these info? Can you add the documentation link as well.? – Balasubramanian Naagarajan May 01 '20 at 15:14
1

Kafka's Streams API to process data in Kafka exactly-once: https://kafka.apache.org/documentation/streams/ – miguno Jul 03 '20 at 12:48
1

Flink to process data in Kafka exactly-once: https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html – miguno Jul 03 '20 at 12:49

score -1 · Answer 3 · answered May 07 '20 at 15:25

-1

You can implement all of the message delivery semantics you listed including at-least once, at-most once, and effectively-once.

For at-most once, you would use an exclusive subscription type to ensure that only consumer gets the message, and have your consumer acknowledge all messages it receives regardless of whether an exception occurs or not.

For effectively-once, you would use an exclusive subscription type to ensure that only consumer gets the message, and only send an acknowledgment if you are able to successfully process the message (i.e. no exceptions, etc) Otherwise, you would negatively ack the message to have it redelivered.

All other combinations of behavior would fall under the at-least once delivery guarantee.

https://pulsar.apache.org/docs/en/2.5.1/concepts-messaging/#consumers

answered May 07 '20 at 15:25

David Kjerrumgaard

1,056
7
10

What you describe as "effectively-once" seems to be "at-least-once": if the message was processed successfully, and the the client fails before it can send the ack (if the client send the ack but it's lost), the message would be delivered to the client after recovery from my understanding. – Matthias J. Sax Jul 07 '20 at 18:45
You are correct that the message would be re-delivered, which can be easily handled by the use of producer idempotency as described in this blog (https://www.splunk.com/en_us/blog/it/effectively-once-semantics-in-apache-pulsar.html) – David Kjerrumgaard Jul 08 '20 at 20:46
How can _producer_ idempotency avoid reading a message multiple times? If I have a simple consumer application, there is no producer. -- And if the message is re-delivered, it's at-least-once. – Matthias J. Sax Jul 08 '20 at 20:51
An application that needs to process messages only once using the consumer will have to rely on some external system to perform the final deduplication. With thePulsar Reader API, the application can store the message ID associated with the last successfully processed message in an external system and read that value to know where to resume processing from. With producer idempotency, I mean that the system as a whole needs to be able to identify and discard messages that have already been published and prevent them from being retransmitted (after the recovery) – David Kjerrumgaard Jul 10 '20 at 00:39

Apache Pulsar message delivery semantics

3 Answers3