8


The problem:
When sending 1000 tasks to apply_async, they run in parallel on all 48 CPUs, but then sometimes fewer and fewer CPUs run, until only one CPU left is running, and only when the last one finishes its task, then all the CPUs continue running again each with a new task. It shouldn't need to wait for any "task batch" like this..

My (simplified) code:

from multiprocessing import Pool
pool = Pool(47)
tasks = [pool.apply_async(json2features, (j,)) for j in jsons]
feats = [t.get() for t in tasks]

jsons = [...] is a list of about 1000 JSONs already loaded to memory and parsed to objects.
json2features(json) does some CPU-heavy work on a json, and returns an array of numbers.
This function may take between 1 second and 15 minutes to run, and because of this I sort the jsons using a heuristic, s.t. hopefully the longest tasks are first in the list, and thus start first.

The json2features function also prints when a task is finished and how long it took. It all runs on an ubuntu server with 48 cores and like I said above, it starts out great, using all 47 cores. Then as the tasks get completed, fewer and fewer cores run, which at first sounds perfectly ok, where it not because after the last core is finished (when I see its print to stdout), all CPUs start running again on new tasks, meaning it wasn't really the end of the list. It may do the same thing again, and then again for the actual end of the list.

Sometimes it can be using just one core for 5 minutes, and when the task is finally done, it starts using all cores again, on new tasks. (So it's not stuck on some IPC overhead)

There are no repeated jsons, nor any dependencies between them (it's all static, fresh-from-disk data, no references etc..), nor any dependency between json2features calls (no global state or anything) except for them using the same terminal for their print.

I was suspicious that the problem was that a worker doesn't get released until get is called on its result, so I tried the following code:

from multiprocessing import Pool
pool = Pool(47)
tasks = [pool.apply_async(print, (i,)) for i in range(1000)]
# feats = [t.get() for t in tasks]

And it does print all 1000 numbers, even though get isn't called.

I have ran out of ideas right now what the problem might be.
Is this really the normal behavior of Pool?

Thanks a lot!

Rabak
  • 83
  • 1
  • 4
  • Is it possible the [de]serialisation of the result objects in the parent process is taking a long time and blocks the other children fetching a new job to work on (from the pool/parent process)? – Tom Dalton Oct 30 '17 at 10:52
  • "json2features(json) does some CPU-heavy work on a json, and returns an array of numbers." How big is that result object? – Tom Dalton Oct 30 '17 at 10:53
  • The deserialisation finished by the time this code runs (it's waited upon to finish) The size of the array is about 0.3 Mb, so all 1000 make 300Mb (I later append them all and store with pickle and the files are this size). The server has 94 Gb RAM. Thank you for your reply, I didn't see it before – Rabak Oct 30 '17 at 12:28
  • Also it's python 3.5 and 64-bit version – Rabak Oct 30 '17 at 12:29
  • What if you try `map_async` instead? – Yaroslav Surzhikov Oct 30 '17 at 16:16
  • Tried map_async, saw that it even gets 'stuck' at 0 CPUs 'working' for a long time, and doesn't immediately start after the last one. I think this is proof that the IPC overhead is the problem. Apparently, `Pool` has a high "cost per byte" when it comes to synchronizing all the result data between all the processes. My next course of action would be to have them write with gzip to a file instead of returning the array, then reading those files afterwards, I think this will be faster. I think it's not cost-effective in this project though... if I do get the chance to try, I'll post the results – Rabak Oct 31 '17 at 08:34
  • To clarify, it had 0 CPUs working for a couple of minutes, then continued with MORE jsons. In other words, it wasn't the end of the list. – Rabak Oct 31 '17 at 08:36

1 Answers1

5

The multiprocessing.Pool relies on a single os.pipe to deliver the tasks to the workers.

Usually on Unix, the default pipe size range from 4 to 64 Kb in size. If the JSONs you are delivering are large in size, you might get the pipe clogged at any given point in time.

This means that, while one of the workers is busy reading the large JSON from the pipe, all the other workers will starve.

It is generally a bad practice to share large data via IPC as it leads to bad performance. This is even underlined in the multiprocessing programming guidelines.

Avoid shared state

As far as possible one should try to avoid shifting large amounts of data between processes.

Instead of reading the JSON files in the main process, just send the workers their file names and let them open and read the content. You will surely notice an improvement in performance because you are moving the JSON loading phase in the concurrent domain as well.

Note that the same is true also for the results. A single os.pipe is used to return the results to the main process as well. If one or more workers clog the results pipe then you will get all the processes waiting for the main one to drain it. Large results should be written to files as well. You can then leverage multithreading on the main process to quickly read back the results from the files.

Community
  • 1
  • 1
noxdafox
  • 14,439
  • 4
  • 33
  • 45
  • aha, great link I should have read those guidelines! it makes complete sense. Thank you! – Rabak Oct 31 '17 at 09:29
  • @Rabak in addition I propose to read [this](https://github.com/JohnStarich/python-pool-performance) – Yaroslav Surzhikov Nov 01 '17 at 01:02
  • thanks, I also tried what you said now, writing to files instead of using the built-in pipe, and it not only solved the 'idle problem' but dramatically increased performance! – Rabak Nov 01 '17 at 09:12
  • Told you so "You will surely notice an improvement in performance..." ;) – noxdafox Nov 01 '17 at 09:22