13

I've got an app sending messages on an epgm PUB socket to one or more epgm SUB sockets. Things mostly work, but if a subscribing application is left up long enough, it will generally end up missing a message or a few messages. (My messages have sequence numbers, so I can tell if any are missing or out of order.) Based on my reading of the ZMQ docs, I would have thought that the "reliable multicast" nature of epgm would prevent this from happening, that after a SUB socket gets one message, it's guaranteed to keep getting them until shutdown or until major network troubles (ie, the connection is maxed out).

Anyway, that's the context, but the question is simply the title: What reliability guarantees (if any) does ZMQ make for PUB/SUB over epgm?

scott
  • 1,127
  • 1
  • 12
  • 21
  • Are you setting a high watermark in the publisher? – Steve-o Apr 04 '13 at 20:16
  • Just using the default. Message rate is not high, 2 10B-160KB messages per second, each in one frame. (Average message size is 80 KB.) Figured 1000 was more than enough. – scott Apr 04 '13 at 21:44
  • Have you verified the back channel is operating correctly: are the sender and receiver using the correct network interfaces? You can follow the PGM protocol in WireShark for example. – Steve-o Apr 04 '13 at 22:32
  • I haven't verified the back channel, I don't really know how to do that. But I'll see if I can figure it out in morning. I am definitely setting the network interface to use on both client and server using the format epgm://[IP of local interface];[multicast group IP]:[multicast port], if that's related to what you're talking about. – scott Apr 04 '13 at 23:19

2 Answers2

12

The PGM implementation within ZeroMQ uses an in-memory window for recovery thus is only short lived. If recovery fails due to the window being exhausted: for example publishing faster than it takes a recovery to transition, then the underlying PGM socket will reset and continue at best effort.

This means at high data rates or significant packet loss the transport will be constantly resetting and you will be dropping messages that cannot be recovered: hence reliable delivery not guaranteed.

The PGM configuration is targeted at real time broadcast such that slow receivers cannot stall the sender. The protocol does support both paradigms but the latter has not been implemented due to lack of demand.

Steve-o
  • 12,678
  • 2
  • 41
  • 60
  • 1
    What are the practical limitations on the recovery window? I set the ZMQ_RECOVERY_IVL to 10 seconds, which at my data rate is a reasonable amount of memory (<5MB), and I've never observed more than 2-3 seconds worth of data being skipped at once. I would have thought I'd get reliability under those circumstances. – scott Apr 05 '13 at 00:13
  • Something looks wrong, try enabling [OpenPGM logging](https://code.google.com/p/openpgm/wiki/OpenPgm5CReferenceErrorHandling) and see if anything interesting appears. – Steve-o Apr 05 '13 at 00:46
  • 2
    OpenPGM logging provided a lot of info that made absolutely no sense to me. But I bumped up my ZMQ_RECOVERY_IVL to 100 seconds, and my ZMQ_SNDBUF and ZMQ_RCVBUF to 10 MB, and that caused a dramatic improvement in reliability. – scott Apr 05 '13 at 20:10
6

ZeroMQ makes exactly one guarantee: all messages are complete - you will never receive partial messages. It makes no guarantee of reliability. You should check out the documentation of the high water mark (HWM) behavior, which is the most common cause for dropped messages, as illustrated by the suicidal snail.

minrk
  • 37,545
  • 9
  • 92
  • 87
  • 3
    Isn't EPGM "reliable multicast"? Why bother having that protocol available through ZMQ if it doesn't actually get you reliability. Why not just support multicast through raw UDP sockets? – scott Apr 04 '13 at 19:43
  • @scott PGM also implements ordered delivery, switched networks will reorder packets and PGM can reassemble the original sequence. There are also important semantics of multicast path definition that requires supporting architecture. – Steve-o Apr 04 '13 at 20:28