Unexpected backlog size in Pulsar

Question

I'm using Pulsar for communication between services and I'm experiencing flakiness in a quite simple test of producers and consumers.

In JUnit 4 test, I spin up (my own wrappers around) a ZooKeeper server, a BookKeeper bookie, and a PulsarService; the configurations should be quite standard.

The test can be summarized in the following steps:

build a producer;
build a consumer (say, a reader of a Pulsar topic);
check the message backlog (using precise backlog);
- this is done by getting the current subscription via PulsarAdmin#topics#getStats#subscriptions
- I expect it to be 0, as nothing was sent on the topic, but sometimes it is 1, but this seems another problem...
build a new producer and synchronously send a message onto the topic;
build a new consumer and read the messages on the topic;
- I expect a backlog of one message, and I actually read one
build a new producer and synchronously send four messages;
fetch again the messages, using the messageID read at step 5 as start message ID;
- I expect a backlog of four messages here, and most of the time this value is correct, but running the test about ten times I consistently get 2 or 5

I tried debugging the test, but I cannot figure out where those values come from; did I misunderstand something?

score 0 · Accepted Answer · answered Nov 30 '21 at 16:56

0

Things you can try if not already done:

Ask for precise backlog measurement. By default, it's only estimated as getting the precise measurement is a costlier operation. Use admin.topics().getStats(topic, true) for this. (See https://github.com/apache/pulsar/blob/724523f3051def9577d6bd27697866c99f4a7b0e/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L862)
Deactivate batching on the producer side. The number returned in msgBacklog is the number of entries so multiple messages batched in a single entry will count as 1. See relevant issue : https://github.com/apache/pulsar/issues/7623. It can explain why you see a value of 2 for the msgBacklog if the 4 messages have been put in the same batch. Beware that deactivating batching can have a huge impact on performance.

answered Nov 30 '21 at 16:56

Christophe Bornet

1,062
8
8

Yes, I am already using precise backlog (I'll update the question adding this, it is in fact an important detail, thanks). Speaking about batching, I'm not sure about it and I will surely check, but I'm doubtful about it: if it returns the number of batches, it should _always_ return 2, not sometimes 2, sometimes 5, most of the times 4. Do you agree with me? – NiccoMlt Dec 02 '21 at 08:01
Hi, I also tried disabling batching on the producer side, and now the flaky test is not flaky anymore: in fact, it consistently fails with a backlog of 5 instead of 4. – NiccoMlt Dec 22 '21 at 08:48
I think we managed to find the problem in that test: because batching was enabled, and our abstraction layer on Pulsar always leverages asynchronous sends to publish on the broker, the messages might be batched differently, and the values we experienced can be explained as follows: - 2: all the messages were batched, except the first one which had a consume operation in-between (so 1,[2-5]); - 4: only the last two messages (which were the nearest in time) were batched (so 1,2,3,[4-5]); - 5: none of the messages were batched. – NiccoMlt Dec 24 '21 at 09:44
Also, because of how our abstraction layer works (it doesn’t send acknowledges, but it stores the last message ID to read from) and because of the way backlog message count is resolved, it makes sense to get a value of 5 when disabling batching: the admin instance is not provided with a message to start read from, so the backlog counts all the messages produced in the test. Makes sense to you @ ChristopheBornet? – NiccoMlt Dec 24 '21 at 09:50
It's a bit hard to say without having the code under the eyes. But I'd say it makes sense. – Christophe Bornet Jan 12 '22 at 12:41

Unexpected backlog size in Pulsar

1 Answers1