Python and ZMQ: REP socket not receiving all requests

Question

I have a REP socket that's connected to many REQ sockets, each running on a separate Google Compute Engine instance. I'm trying to accomplish the synchronization detailed in the ZMQ Guide's syncpub/syncsub example, and my code looks pretty similar to that example:

context = zmq.Context()
sync_reply = context.socket(zmq.REP)
sync_reply.bind('tcp://*:5555')

# start a bunch of other sockets ...

ready = 0
while ready < len(all_instances):
    sync_reply.recv()
    sync.reply.send(b'')
    ready += 1

And each instance is running the following code:

context = zmq.Context()
sync_request = context.socket(zmq.REQ)
sync_request.connect('tcp://IP_ADDRESS:5555')

sync_request.send(b'')
sync_request.recv()

# start other sockets and do other work ...

This system works fine up until a certain number of instances (around 140). Any more, though, and the REP socket will not receive all of the requests. It also seems like the requests it drops are from different instances each time, which leads me to believe that all the requests are indeed being sent, but the socket is just not receiving any more than (about) 140 of them.

I've tried setting the high water mark for the sockets, spacing out the requests over the span of a few seconds, switching to ROUTER/DEALER sockets - all with no improvement. The part that confuses me the most is that the syncsub/syncpub example code (linked above) works fine for me with up to 200 Google Compute Engine instances, which is as many as I can start. I'm not sure what about my code specifically is causing this problem - any help or tips would be appreciated.

what happens if you comment out your "start a bunch of other sockets" code? Does your capacity go up, or stay the same? — Jason, Aug 05 '14 at 13:36
@Jason I may have fudged the truth a little, in that "start a bunch of other sockets" also includes a bit of work outside of starting sockets, and the while loop is actually inside a function that gets called later. Commenting out the other sockets (I have 6, for doing other work) does seem to increase the capacity, though, up to 200 instances which is my personal quota for GCE. Around ~180 instances, however, it starts to become a little unreliable again, i.e. it may take more than one try to successfully receive all requests. Any idea why all of this would be the case? Thanks for your help! — Marvin, Aug 05 '14 at 20:21
ZMQ is designed more for high message volume than high connection volume... it's not that it won't stand a high connection volume, but rather that that's not where the work has gone. See some context [here](http://stackoverflow.com/questions/23625210/load-testing-zeromq-zmq-stream-for-finding-the-maximum-simultaneous-users-it-c) [here](http://hintjens.com/blog:42) and [here](http://comments.gmane.org/gmane.network.zeromq.devel/19524). I haven't used GCE, what sort of memory limits are you working with? — Jason, Aug 05 '14 at 20:50
@Jason I'm not sure what you mean by memory limits, but for Google Compute Engine I have a hard limit of 200 instances and 1600 CPU's (i.e., I'm not allowed any more than that at any one time). The instances I've been starting up are the smallest ones they offer (specs are 1 vCPU shared physical core and 0.6 GB RAM) since there's no sense in wasting money, but I don't think that's the issue since the same problem appears with larger instances as well. Is scaling down the number of sockets I use the only feasible solution in this case? — Marvin, Aug 06 '14 at 00:58
I don't really know what your options are, I haven't actually tried what you're doing, I've just gained some information from other people trying to use ZMQ in this way. When you scale up the number of connections, then you'll run into issues that need managing. I'm betting the 0.6GB RAM is going to limit the number of connections you can run, though I don't have a sense where that limit should reasonably be. What sort of message volume are you running on the other sockets that aren't part of this particular problem? What's your average message size on those sockets? — Jason, Aug 06 '14 at 14:47
@Jason I realized two of the other sockets were unnecessary, so apart from this socket I have four others. Basically, I have two different types of jobs, and they both involve sending function calls to the instances and receiving results, hence the four sockets. So the message size is relatively small, just one (or a couple) serialized function call (or calls). I suppose the message volume is pretty substantial, though, certainly comparable to the volume on the socket I'm having issues with. — Marvin, Aug 06 '14 at 21:00
If "substantial" means many thousands of messages per second (or more!) per socket, then I could guess that the overhead might stress 600MB RAM. Again, I don't really know what's reasonable to expect here, my own use case is relatively low-volume. Since you're issue comes with scaling, it's reasonable to think that system resources are the culprit, and your RAM appears to be the most limiting factor. If you know you can get it to work by scaling sockets down, that's an option, otherwise I would try larger instances with more RAM and/or more cores and see if that gets you anywhere. — Jason, Aug 07 '14 at 13:43
@Jason Thanks for all your help! I've been able to make some progress by scaling down the number of sockets I use, and I'll see what larger instances can offer. — Marvin, Aug 08 '14 at 00:36
Wish I could have offered a concrete solution, but glad you're further down the trail. — Jason, Aug 08 '14 at 13:36

score 0 · Accepted Answer · answered Aug 08 '14 at 00:38

0

Answering my own question - it seems like it was an issue with the large number of sockets I was using, and also possibly the memory limitations of the GCE instances used. See comment thread above for more details.

answered Aug 08 '14 at 00:38

Marvin

51
4

Python and ZMQ: REP socket not receiving all requests

1 Answers1