What is the most efficient ZeroMQ polling on a single PUB/SUB socket?

Question

The ZeroMQ documentation mentions a zmq_poll as a method for multi-plexing multiple sockets on a single thread. Is there any benefit to polling in a thread that simply consumes data from one socket? Or should I just use zmq_recv?

For example:

/*                                                POLLING A SINGLE SOCKET */
while (true) {
   zmq::poll(&items[0], 1, -1);
   if (items[0].revents & ZMQ_POLLIN) {
      int size = zmq_recv(receiver, msg, 255, 0);
      if (size != -1) {
      // do something with msg
      }
   }
}

vs.

/*                                               NO POLLING AND BLOCKING RECV */
while (true) {
    int size = zmq_recv(receiver, msg, 255, 0);
    if (size != -1) {
        // do something with msg
    }
}

Is there ever a situation to prefer the version with polling, or should I only use it for multi-plexing? Does polling result in more efficient CPU usage? Does the answer depend on the rate of messages being received?

*** Editing this post to include a toy example ***

The reason for asking this question is that I have observed that I can achieve a much higher throughput on my subscriber if I do not poll (more than an order of magnitude)

#include <thread>
#include <zmq.hpp>
#include <iostream>
#include <unistd.h>
#include <chrono>

using msg_t = char[88];
using timepoint_t = std::chrono::high_resolution_clock::time_point;
using milliseconds = std::chrono::milliseconds;
using microseconds = std::chrono::microseconds;

/* Log stats about how many packets were sent/received */
class SocketStats {
   public:
      SocketStats(const std::string& name) : m_socketName(name), m_timePrev(now()) {}
      void update() {
         m_numPackets++;
         timepoint_t timeNow = now();
         if (duration(timeNow, m_timePrev) > m_logIntervalMs) {
            uint64_t packetsPerSec = m_numPackets - m_numPacketsPrev;
            std::cout << m_socketName << " : " << "processed " << (packetsPerSec) << " packets" << std::endl;
            m_numPacketsPrev = m_numPackets;
            m_timePrev = timeNow;
         }
      }
   private:
      timepoint_t now() { return std::chrono::steady_clock::now(); }
      static milliseconds duration(timepoint_t timeNow, timepoint_t timePrev) { 
         return std::chrono::duration_cast<milliseconds>(timeNow - timePrev);
      }
      timepoint_t m_timePrev;
      uint64_t m_numPackets = 0;
      uint64_t m_numPacketsPrev = 0;
      milliseconds m_logIntervalMs = milliseconds{1000};
      const std::string m_socketName;
};

/* non-polling subscriber uses blocking receive and no poll */
void startNonPollingSubscriber(){
   SocketStats subStats("NonPollingSubscriber");
   zmq::context_t ctx(1);
   zmq::socket_t sub(ctx, ZMQ_SUB);
   sub.connect("tcp://127.0.0.1:5602");
   sub.setsockopt(ZMQ_SUBSCRIBE, "", 0);

   while (true) {
      zmq::message_t msg;
      bool success = sub.recv(&msg);
      if (success) { subStats.update(); }
   }
}

/* polling subscriber receives messages when available */
void startPollingSubscriber(){
   SocketStats subStats("PollingSubscriber");
   zmq::context_t ctx(1);
   zmq::socket_t sub(ctx, ZMQ_SUB);
   sub.connect("tcp://127.0.0.1:5602");
   sub.setsockopt(ZMQ_SUBSCRIBE, "", 0);

   zmq::pollitem_t items [] = {{static_cast<void*>(sub), 0, ZMQ_POLLIN, 0 }};

   while (true) {
      zmq::message_t msg;
      int rc = zmq::poll (&items[0], 1, -1);
      if (rc < 1) { continue; }
      if (items[0].revents & ZMQ_POLLIN) {
         bool success = sub.recv(&msg, ZMQ_DONTWAIT);
         if (success) { subStats.update(); }
      }
   }
}

void startFastPublisher() {
   SocketStats pubStats("FastPublisher");
   zmq::context_t ctx(1);
   zmq::socket_t pub(ctx, ZMQ_PUB);
   pub.bind("tcp://127.0.0.1:5602");

   while (true) {
      msg_t mymessage;
      zmq::message_t msg(sizeof(msg_t));
      memcpy((char *)msg.data(), (void*)(&mymessage), sizeof(msg_t));
      bool success = pub.send(&msg, ZMQ_DONTWAIT);
      if (success) { pubStats.update(); }
   }
}

int main() {
    std::thread t_sub1(startPollingSubscriber);
    sleep(1); 
    std::thread t_sub2(startNonPollingSubscriber);
    sleep(1);
    std::thread t_pub(startFastPublisher); 
    while(true) {
       sleep(10);
    }
}

score 0 · Answer 1 · edited Oct 22 '20 at 21:15

0

Q : "Is there any benefit to polling in a thread that simply consumes data from one socket?"

Oh sure there is.

As a principal promoter of non-blocking designs, I always advocate to design zero-waiting .poll()-s before deciding on .recv()-calls.

Q : "Does polling result in more efficient CPU usage?"

A harder one, yet I love it:

This question is decidable in two distinct manners:

a) read the source-code of both the .poll()-method and the .recv()-method, as ported onto your target platform and guesstimate the costs of calling each v/s the other

b) test either of the use-cases inside your run-time ecosystem and have the hard facts micro-benchmarked in-vivo.

Either way, you see the difference.

What you cannot see ATM are the add-on costs and other impacts that appear once you try (or once you are forced to) extend the use-case so as to accomodate other properties, not included in either the former or the latter.

Here, my principal preference to use .poll() before deciding further, enables other priority-based re-ordering of actual .recv()-calls and other, higher level, decisions, that neither the source-code, nor the test could ever decide.

Do not hesitate to test first and if tests will seem to be inconclusive (on your scale of { low | ultra-low }-latency sensitivity), may deep into the source-code to see why.

edited Oct 22 '20 at 21:15

halfer

19,824
17
99
186

answered Oct 10 '20 at 11:44

user3666197

1
6
50
92

1

The reason I asked this question is because I had originally opted for the polling approach, and I have recently observed in my use case that my subscriber is able to achieve a much higher throughput if I instead do NOT poll and just use a blocking receive. I am able to reproduce this with a simple toy example: https://gist.github.com/drigie/f0735e9e70028adabaabad54f9104954 In this case, the non-polling subscriber is able to process about 30-35X more messages per second. I have edited my original post to include the toy example – sadrig Oct 10 '20 at 19:26
The toy code posted above yields a throughput of about 30K message/sec for polling subscriber and >1,000,000 messages/sec for non-polling subscriber on 2017 Macbook Pro. When no messages are being published even the non-polling subscriber seems to result in negligible CPU utilization – sadrig Oct 10 '20 at 19:34
No surprise, a 1st headbang appears the very first moment a blocking-.recv() locks you & your processing turns to an unmanageably halted state (that no one seriously designing systems want ever to find oneselves in). The same applies to use-cases, where more things have to happen within some soft-constrained flow of time (a soft-RTOS orchestration) where a secure, guaranteed non-blocking / self-resilient modus operandi is always a way more important than a number of messages passed under some optimistic, sunny & blue sky conditions :o) but gets inadvertently self-deadlocked in real-world usage – user3666197 Oct 10 '20 at 20:10
If hunting performance, you may increase the I/O-threads in each Context()-instance, so that the processing goes full steam, concurrently, instead of mutually competing on a single I/O-thread. More on this may help in tests, yet the robustness, not losing control and dead-lock avoidance is way more important in real-world systems. Anyway, enjoy the Zen-of-Zero & Stay tuned ( if testing just theoretical benchmarking horizons, you might like Martin SUSTRIK's younger product, the nanomsg ... ) – user3666197 Oct 10 '20 at 20:14
For my use case, the overhead of polling mentioned above does seem to be *by far* the bottleneck and prevents achieving the desired message rates. Increasing number of IO threads does not help either. Without polling, I can easily achieve the needed throughput, but it sounds like the consensus is that is not a viable design? Tested the CPU utilization in both cases, and polling vs. non-polling subscriber have virtually the same load regardless of message rate – sadrig Oct 10 '20 at 21:29
Given the message-data is already moved and stored inside the SUB-side's Context()-instance's buffer, the only "work" here is to pass a pointer ( +do some counter increments over circular buffered memory-maps ). No magic there. Problems start with going over a wire, and losing messages or being subject to a LoS and other adverse conditions. Memory-mapped messaging could be even stripped-off to become stack-less on localhost, using a pure memory-mapping as it goes pure-pointer-wise, when using inproc:// transport-class. Not a performance, but the robustness / security-certification needs decide – user3666197 Oct 10 '20 at 22:14

score 0 · Answer 2 · answered May 06 '21 at 18:46

Typically you will get the best performance by draining the socket before doing another poll. In other words, once poll returns, you read until read returns "no more data", at which point you call poll.

For an example, see https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L524

There are trade-offs with this approach (mentioned in the code comments), but if you're only reading from a single socket you can ignore these.

Thanks for this! This is the solution I ended up going with – sadrig May 07 '21 at 19:47 — sadrig, May 07 '21 at 19:47

What is the most efficient ZeroMQ polling on a single PUB/SUB socket?

2 Answers2