0

I'm using Pulsar for communication between services and I'm experiencing flakiness in a quite simple test of producers and consumers.

In JUnit 4 test, I spin up (my own wrappers around) a ZooKeeper server, a BookKeeper bookie, and a PulsarService; the configurations should be quite standard.

The test can be summarized in the following steps:

  1. build a producer;
  2. build a consumer (say, a reader of a Pulsar topic);
  3. check the message backlog (using precise backlog);
    • this is done by getting the current subscription via PulsarAdmin#topics#getStats#subscriptions
    • I expect it to be 0, as nothing was sent on the topic, but sometimes it is 1, but this seems another problem...
  4. build a new producer and synchronously send a message onto the topic;
  5. build a new consumer and read the messages on the topic;
    • I expect a backlog of one message, and I actually read one
  6. build a new producer and synchronously send four messages;
  7. fetch again the messages, using the messageID read at step 5 as start message ID;
    • I expect a backlog of four messages here, and most of the time this value is correct, but running the test about ten times I consistently get 2 or 5

I tried debugging the test, but I cannot figure out where those values come from; did I misunderstand something?

NiccoMlt
  • 135
  • 2
  • 12

1 Answers1

0

Things you can try if not already done:

Christophe Bornet
  • 1,062
  • 8
  • 8
  • Yes, I am already using precise backlog (I'll update the question adding this, it is in fact an important detail, thanks). Speaking about batching, I'm not sure about it and I will surely check, but I'm doubtful about it: if it returns the number of batches, it should _always_ return 2, not sometimes 2, sometimes 5, most of the times 4. Do you agree with me? – NiccoMlt Dec 02 '21 at 08:01
  • Hi, I also tried disabling batching on the producer side, and now the flaky test is not flaky anymore: in fact, it consistently fails with a backlog of 5 instead of 4. – NiccoMlt Dec 22 '21 at 08:48
  • I think we managed to find the problem in that test: because batching was enabled, and our abstraction layer on Pulsar always leverages asynchronous sends to publish on the broker, the messages might be batched differently, and the values we experienced can be explained as follows: - 2: all the messages were batched, except the first one which had a consume operation in-between (so 1,[2-5]); - 4: only the last two messages (which were the nearest in time) were batched (so 1,2,3,[4-5]); - 5: none of the messages were batched. – NiccoMlt Dec 24 '21 at 09:44
  • Also, because of how our abstraction layer works (it doesn’t send acknowledges, but it stores the last message ID to read from) and because of the way backlog message count is resolved, it makes sense to get a value of 5 when disabling batching: the admin instance is not provided with a message to start read from, so the backlog counts all the messages produced in the test. Makes sense to you @ ChristopheBornet? – NiccoMlt Dec 24 '21 at 09:50
  • It's a bit hard to say without having the code under the eyes. But I'd say it makes sense. – Christophe Bornet Jan 12 '22 at 12:41