14

I am using Python's multiprocessing module to do scientific parallel processing. In my code I use several working processes which does the heavy lifting and a writer process which persists the results to disk. The data to be written is send from the worker processes to the writer process via a Queue. The data itself is rather simple and solely consists of a tuple holding a filename and a list with two floats. After several hours of processing the writer process often would get stuck. More precisely the following block of code

while (True):
    try:
        item = queue.get(timeout=60)
        break
    except Exception as error:
        logging.info("Writer: Timeout occurred {}".format(str(error)))

will never exit the loop and I get continuous 'Timeout' messages.

I also implemented a logging process which outputs, among others, the status of the queue and, even though I get the timeout error message above, a call to qsize() constantly returns a full queue (size=48 in my case).

I have thoroughly checked the documentation on the queue object and can find no possible explanation for why the get() returns timeouts while the queue is full at the same time.

Any ideas?

Edit:

I modified the code to make sure I catch an empty queue exception:

while (True):
    try:
        item = queue.get(timeout=60)
        break
    except Empty as error:
        logging.info("Writer: Timeout occurred {}".format(str(error)))
AndreJohannes
  • 141
  • 1
  • 1
  • 4
  • 1
    I have exactly the same problem. However, i sidestepped the problem by retrying .get() after timeout if queue is still full and it usually works, but this is not really the solution. Did you manage to find the solution? – Jaka Jun 14 '18 at 07:46
  • Just to check: I understand that you are using a `multiprocessing.Queue` and not a `Queue` from any other module. Right? – azelcer Jan 21 '22 at 03:39
  • 1. Why are you certain that there is a message on the queue? 2. Why are you doing a `get` with timeout instead of just a blocking `get` with no timeout? This writer process could be a daemon process that will end when the main process ends so there is no problem with it blocking on a `get` -- it would not prevent program termination. Or you can send a special *sentinel* message such as `None` signifying that there are no more messages coming and it should return if you do not want to use a daemon process. – Booboo Jan 21 '22 at 13:16

3 Answers3

3

In multiprocessing queue is used as synchronized message queue. This also seems to be the case in your problem. This however requires more than just call to get() method. After every task is processed you need to call task_done() so that element get removed from queue.

From documentation:

Queue.task_done()

Indicate that a formerly enqueued task is complete. Used by queue consumer threads. For each get() used to fetch a task, a subsequent call to task_done() tells the queue that the processing on the task is complete.

If a join() is currently blocking, it will resume when all items have been processed (meaning that a task_done() call was received for every item that had been put() into the queue).

In documentation you will also find code example of proper threading queue usage.

In case of your code it should be like this

while (True):
    try:
        item = queue.get(timeout=60)
        if item is None:
            break
        # call working fuction here
        queue.task_done()
    except Exception as error:
        logging.info("Writer: Timeout occurred {}".format(str(error)))
Community
  • 1
  • 1
Tomasz Plaskota
  • 1,329
  • 12
  • 23
  • 1
    Thank you for your responds Tomasz, I doubt however that task_done() is the solution to the problem. According to the documentation task_done() is only implemented for a joinable queue (multiprocessing.JoinableQueue) which I don't use. Also, almost all examples I find on the internet do not use the task_done method. – AndreJohannes May 22 '17 at 07:06
  • `Queue.get()` method removes the element which is returned from the queue. You can see the difference [python-queue-get-task-done-question](http://stackoverflow.com/questions/1593299/python-queue-get-task-done-question) This might be the correct answer, but the explanation in the top paragraph is misleading. – Pavan May 22 '17 at 07:09
  • Ok, but I use multiprocessing.Queue and not multiprocessing.JoinableQueue and only the latter offers the task_done() method. I also think the documentation explains clearly that the task_done() method is used to let the queue know when the task was finished so a join() can unlock. A feature I don't use. – AndreJohannes May 22 '17 at 07:19
  • Yeah, I'll admit that I usually use this in threading context and not multiprocessing one. I've been now looking in cpython source to find exact differences in queue and multiprocessing.queue that could be causing the issue you are having. Also with normal `Queue.get()` item is not removed until it is cleared. `qsize()` returns only approximate size, not counting elements that are processing, so in theory it could give empty queue, while still being blocking due to underlaying buffer being full. – Tomasz Plaskota May 22 '17 at 07:25
  • As for behavior that would make `qsize()` return max size and still being blocking to `get()` calls has me baffled in my context of knowing queues. By documentation when `get()` fails on timeout it should raise `Queue.Empty` exception. Is that the case for you, or is this different error? (error is defined in standard queue module not multiprocessing) – Tomasz Plaskota May 22 '17 at 07:30
  • Thank you Tomasz, I think thats closer to the solution. I maybe should have a look at the underlying cpython source too. I know the data being send via a queue have to be pickable and I assume the processing involved with pickling accounts for the fact that qsize() only returns approximate numbers. What strikes me odd is that everything seems to work fine for several hours and then suddenly qsize() keeps returning 48 (max size of my queue) while a get() doesn't return any data but invokes timeout errors. I would understand if that happened once in a while and resolves itself after some time. – AndreJohannes May 22 '17 at 07:37
  • Tomasz, I just wrote in a comment to noxdafox response that I will rerun my code now and making sure I get the exact error message. It will take a few hours though... – AndreJohannes May 22 '17 at 09:00
3

Switching to manager based queue should help solve this issue.

manager = Manager()
queue   = manager.Queue()

For more details you can check multiprocessing documentation here: https://docs.python.org/2/library/multiprocessing.html

m3th0d
  • 56
  • 3
-1

You are catching a too generic Exception and assuming that it is a Timeout error.

Try modifying the logic as follows:

from Queue import Empty

while (True):
    try:
        item = queue.get(timeout=60)
        break
    except Empty as error:
        logging.info("Writer: Timeout occurred {}".format(str(error)))
        print(queue.qsize())

and see if the logline is still printed.

noxdafox
  • 14,439
  • 4
  • 33
  • 45
  • Ok, I admit I cant be sure its a timeout error. And I will rerun it now with excepting the Queue.Empty error. However, If I look into my log file, I get the 'Writer: Timout occurred..' message every 60 seconds, which complies with the specified timeout period. – AndreJohannes May 22 '17 at 08:59
  • noxdafox, I modified the code accordingly. I definitely get an empty queue error thrown. It occurs every 60 seconds while the logger process indicates a full queue. – AndreJohannes May 24 '17 at 06:42
  • I modified the code a bit, try adding that line after the exception and verify there actually are elements in the Queue. One more thing: how big are those elements in the Queue? Kb? Mb? Gb? – noxdafox May 24 '17 at 08:21
  • The size of the elements in the queue is quite small actually. They have following structure: (filename, [float, float]), where the filename typically is no longer than 20 characters. I am processing around 170000 items in total and the queue usually gets stuck after 50000 to 80000 elements have been sent to the output queue. Ok, I will add the line, though my logger process constantly monitors the queue size and claims it remains full. – AndreJohannes May 24 '17 at 12:58
  • Just got the logs from the latest run. I get a "Writer: Timeout occured" and a "Writer: queue size 48" at the same time. Not sure what to make of it, but I think it is a problem inside the Queue implementation, though I think it is a bit wiered no one has eoncountered that issue before. – AndreJohannes May 27 '17 at 10:08
  • This is curios indeed. It seems it always gets stuck when 48 elements are in the queue. Assuming your input to be the same (same elements and order) it might be that the 48th element is the problematic one. I'd suggest you to log on the other side which elements you are putting in the `Queue` so you could nail down the problematic one. Another approach you might want to try is to replace the `Queue` with a `Pipe` and see whether you get the same result. – noxdafox May 27 '17 at 17:16