6

I recently started learning 0MQ. Earlier today, I ran into a blog, Python Multiprocessing with ZeroMQ. It talked about the ventilator pattern in the 0MQ Guide that I read about, so I decided to give it a try.

Instead of just calculating products of numbers by workers as the original code does, I decided to try to make the ventilator send large arrays to workers via 0mq messages. The following is the code that I have been using for my "experiments".

As noted in a comment below, any time I attempted to increase the variable string_length to a number larger than 3MB, the code hangs.

Typical symptom: lets say we set the string_length to 4MB (i.e. 4194304), then perhaps the result manager gets the result from one worker, and then the code just pauses. htop shows the 2 cores not doing much. Etherape network traffic monitor shows no traffic on the lo interface either.

So far, after hours looking around, I have not been able to figure out what's causing this, and would appreciate a hint or two as to why and any resolution about this issue. Thanks!

I am running Ubuntu 11.04 64bit on a Dell notebook with Intel Core due CPU, 8GB RAM, 80GB Intel X25MG2 SSD, Python 2.7.1+, libzmq1 2.1.10-1chl1~natty1, python-pyzmq 2.1.10-1chl1~natty1

import time
import zmq
from multiprocessing import Process, cpu_count

np = cpu_count() 
pool_size = np
number_of_elements = 128
# Odd, why once the slen is bumped to 3MB or above, the code hangs?
string_length = 1024 * 1024 * 3

def create_inputs(nelem, slen, pb=True):
    '''
    Generates an array that contains nelem fix-sized (of slen bytes)
    random strings and an accompanying array of hexdigests of the 
    former's elements.  Both are returned in a tuple.

    :type nelem: int
    :param nelem: The desired number of elements in the to be generated
                  array.
    :type slen: int
    :param slen: The desired number of bytes of each array element.
    :type pb: bool
    :param pb: If True, displays a text progress bar during input array
               generation.
    '''
    from os import urandom
    import sys
    import hashlib

    if pb:
        if nelem <= 64:
            toolbar_width = nelem
            chunk_size = 1
        else:
            toolbar_width = 64
            chunk_size = nelem // toolbar_width
        description = '%d random strings of %d bytes. ' % (nelem, slen) 
        s = ''.join(('Generating an array of ', description, '...\n'))
        sys.stdout.write(s)
        # create an ASCII progress bar
        sys.stdout.write("[%s]" % (" " * toolbar_width))
        sys.stdout.flush()
        sys.stdout.write("\b" * (toolbar_width+1)) 
    array   = list()
    hash4a  = list()
    try:
        for i in range(nelem):
            e = urandom(int(slen))
            array.append(e)
            h = hashlib.md5()
            h.update(e)
            he = h.hexdigest()
            hash4a.append(he)
            i += 1
            if pb and i and i % chunk_size == 0:
                sys.stdout.write("-")
                sys.stdout.flush()
        if pb:
            sys.stdout.write("\n")
    except MemoryError:
        print('Memory Error: discarding existing arrays')
        array  = list()
        hash4a = list()
    finally:
        return array, hash4a

# The "ventilator" function generates an array of nelem fix-sized (of slen
# bytes long) random strings, and sends the array down a zeromq "PUSH"
# connection to be processed by listening workers, in a round robin load
# balanced fashion.

def ventilator():
    # Initialize a zeromq context
    context = zmq.Context()

    # Set up a channel to send work
    ventilator_send = context.socket(zmq.PUSH)
    ventilator_send.bind("tcp://127.0.0.1:5557")

    # Give everything a second to spin up and connect
    time.sleep(1)

    # Create the input array
    nelem = number_of_elements
    slen = string_length
    payloads = create_inputs(nelem, slen)

    # Send an array to each worker
    for num in range(np):
        work_message = { 'num' : payloads }
        ventilator_send.send_pyobj(work_message)

    time.sleep(1)

# The "worker" functions listen on a zeromq PULL connection for "work"
# (array to be processed) from the ventilator, get the length of the array
# and send the results down another zeromq PUSH connection to the results
# manager.

def worker(wrk_num):
    # Initialize a zeromq context
    context = zmq.Context()

    # Set up a channel to receive work from the ventilator
    work_receiver = context.socket(zmq.PULL)
    work_receiver.connect("tcp://127.0.0.1:5557")

    # Set up a channel to send result of work to the results reporter
    results_sender = context.socket(zmq.PUSH)
    results_sender.connect("tcp://127.0.0.1:5558")

    # Set up a channel to receive control messages over
    control_receiver = context.socket(zmq.SUB)
    control_receiver.connect("tcp://127.0.0.1:5559")
    control_receiver.setsockopt(zmq.SUBSCRIBE, "")

    # Set up a poller to multiplex the work receiver and control receiver channels
    poller = zmq.Poller()
    poller.register(work_receiver, zmq.POLLIN)
    poller.register(control_receiver, zmq.POLLIN)

    # Loop and accept messages from both channels, acting accordingly
    while True:
        socks = dict(poller.poll())

        # If the message came from work_receiver channel, get the length
        # of the array and send the answer to the results reporter
        if socks.get(work_receiver) == zmq.POLLIN:
            #work_message = work_receiver.recv_json()
            work_message = work_receiver.recv_pyobj()
            length = len(work_message['num'][0])
            answer_message = { 'worker' : wrk_num, 'result' : length }
            results_sender.send_json(answer_message)

        # If the message came over the control channel, shut down the worker.
        if socks.get(control_receiver) == zmq.POLLIN:
            control_message = control_receiver.recv()
            if control_message == "FINISHED":
                print("Worker %i received FINSHED, quitting!" % wrk_num)
                break

# The "results_manager" function receives each result from multiple workers,
# and prints those results.  When all results have been received, it signals
# the worker processes to shut down.

def result_manager():
    # Initialize a zeromq context
    context = zmq.Context()

    # Set up a channel to receive results
    results_receiver = context.socket(zmq.PULL)
    results_receiver.bind("tcp://127.0.0.1:5558")

    # Set up a channel to send control commands
    control_sender = context.socket(zmq.PUB)
    control_sender.bind("tcp://127.0.0.1:5559")

    for task_nbr in range(np):
        result_message = results_receiver.recv_json()
        print "Worker %i answered: %i" % (result_message['worker'], result_message['result'])

    # Signal to all workers that we are finsihed
    control_sender.send("FINISHED")
    time.sleep(5)

if __name__ == "__main__":

    # Create a pool of workers to distribute work to
    for wrk_num in range(pool_size):
        Process(target=worker, args=(wrk_num,)).start()

    # Fire up our result manager...
    result_manager = Process(target=result_manager, args=())
    result_manager.start()

    # Start the ventilator!
    ventilator = Process(target=ventilator, args=())
    ventilator.start()
user183394
  • 1,033
  • 1
  • 11
  • 20
  • I did more experiments: lowered the number_of_elements to 64 and increased the string_length to 6. The code still ran fine. Above that, the same symptom appeared. This led me to believe that there might be an overall message size limit somewhere in the pyzmq binding. The 0MQ C API has this [link](http://api.zeromq.org/2-1:zmq-msg-init-size) zmq_msg_init_size(3) function which I can't find in pyzmq's documentation. Could this be the cause? – user183394 Jan 18 '12 at 06:18
  • Can you get a traceback where it is hanging? It might give you a hint. – Aaron Watters Jan 18 '12 at 14:27
  • I tried your code on my mac laptop with string_length = 1024 * 1024 * 4 and it worked fine, so I'm guessing it must have something to do with some kind of memory contention. – Aaron Watters Jan 18 '12 at 14:43
  • ...and ran it again, and it froze up... looking at "top" the free memory was bouncing around near 0 so it looks like 0mq is not optimized to handle messages of this size. – Aaron Watters Jan 18 '12 at 15:00
  • @Aaron Watters. I have come to a similar conclusion as yours. But, before I point my finger to 0MQ itself, I will find some time to do the above in C++. As I noticed from a quick browsing of the source, even pyzmq uses the zmq_msg_init_size(), it doesn't expose it. Wondering whether with the function, the outcome may be different? – user183394 Jan 18 '12 at 17:15

1 Answers1

6

The problem is that your ventilator (PUSH) socket is closing before it's done sending. You have a sleep of 1s at the end of the ventilator function, which is not enough to send 384MB messages. That's why you have the threshold you have, if the sleep were shorter then the threshold would be lower.

That said, LINGER is supposed to prevent this sort of thing, so I would bring this up with zeromq: PUSH does not appear to respect LINGER.

A fix for your particular example (without adding an indeterminately long sleep) would be to use the same FINISH signal to terminate your ventilator as your workers. This way, you guarantee that your ventilator survives as long as it needs to.

Revised ventilator:

def ventilator():
    # Initialize a zeromq context
    context = zmq.Context()

    # Set up a channel to send work
    ventilator_send = context.socket(zmq.PUSH)
    ventilator_send.bind("tcp://127.0.0.1:5557")

    # Set up a channel to receive control messages
    control_receiver = context.socket(zmq.SUB)
    control_receiver.connect("tcp://127.0.0.1:5559")
    control_receiver.setsockopt(zmq.SUBSCRIBE, "")

    # Give everything a second to spin up and connect
    time.sleep(1)

    # Create the input array
    nelem = number_of_elements
    slen = string_length
    payloads = create_inputs(nelem, slen)

    # Send an array to each worker
    for num in range(np):
        work_message = { 'num' : payloads }
        ventilator_send.send_pyobj(work_message)

    # Poll for FINISH message, so we don't shutdown too early
    poller = zmq.Poller()
    poller.register(control_receiver, zmq.POLLIN)

    while True:
        socks = dict(poller.poll())

        if socks.get(control_receiver) == zmq.POLLIN:
            control_message = control_receiver.recv()
            if control_message == "FINISHED":
                print("Ventilator received FINSHED, quitting!")
                break
            # else: unhandled message
minrk
  • 37,545
  • 9
  • 92
  • 87
  • minrk, many thank for the insightful answer. Very helpful! I didn't suspect the ZMQ_LINGER value as set by zmq_setsockopt(3), since as you said, the Default value is -1 (infinite). Great catch! I will definitely raise the issue first with pyzmq folks and mention it on the zeromq mailing list as well. I tested your fix all way up to string_length set to 1024 * 1024 * 10, maxed out my notebook's physical RAM and still got the anticipated result. Thanks again! – user183394 Jan 18 '12 at 22:18
  • 3
    Maybe not worth it to bring it up with 'pyzmq folks', since that's basically me right now. I've pinged libzmq about it, and written a simpler test case in C: https://gist.github.com/1643223 – minrk Jan 19 '12 at 22:16