PyCuda / Multiprocessing Issue on OS X 10.8

Question

I'm working on a project where I distribute compute tasks to multiple python Processes each associated with its own CUDA device.

When spawning the subprocesses, I use the following code:

import pycuda.driver as cuda

class ComputeServer(object):
    def _init_workers(self):
        self.workers = []
        cuda.init()
        for device_id in range(cuda.Device.count()):
            print "initializing device {}".format(device_id)
            worker = CudaWorker(device_id)
            worker.start()
            self.workers.append(worker)

The CudaWorker is defined in another file as follows:

from multiprocessing import Process
import pycuda.driver as cuda

class CudaWorker(Process):
    def __init__(self, device_id):
        Process.__init__(self)
        self.device_id = device_id

    def run(self):
        self._init_cuda_context()
        while True:
            # process requests here

    def _init_cuda_context(self):
        # the following line fails
        cuda.init()
        device = cuda.Device(self.device_id)
        self.cuda_context = device.make_context()

When I run this code on Windows 7 or Linux, I have no issues. When running the code on my MacBook Pro with OSX 10.8.2, Cuda 5.0, and PyCuda 2012.1 I get the following error:

Process CudaWorker-1:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/tombnorwood/pymodules/computeserver/worker.py", line 32, in run
    self._init_cuda_context()
  File "/Users/tombnorwood/pymodules/computeserver/worker.py", line 38, in _init_cuda_context
    cuda.init()
RuntimeError: cuInit failed: no device

I have no issues running PyCuda scripts without forking new processes on my Mac. I only get this issue when spawning a new Process.

Has anyone run into this issue before?

I suspect that this is related to fact that OS X has a whole lot of core frameworks that can't be used after `fork`, and either PyCuda or CUDA itself relies on one of them… — abarnert, Feb 06 '13 at 00:01
I actually figured that was the case as well. Is there any way around this? It's really quite annoying. — tnorwood, Feb 06 '13 at 00:03
If that's the case, the simplest workaround is to exec a new Python interpreter, instead of continuing to use the forked one. There's a patched version of `multiprocessing` around somewhere that does this. (It may one day be added to trunk as an option, but it'll never be the default, because that would make OS X `multiprocessing` work more like Windows than like POSIX.) If you want, and you can't find it or figure out how to do it yourself (it's pretty simple, actually), I can dig around for it. — abarnert, Feb 06 '13 at 00:07
Never mind, found it. The Mac-specific issue was filed as a bug, which was closed as a dup of [#8713](http://bugs.python.org/issue8713) (which says "Linux" but is really about all POSIX platforms). If you copy `multiprocessing.py` out of the source, apply the patch, rename it to something else, then just call `mymultiprocessing.forking_disable()` before anything else, it should work. (The patch may need a bit of massaging for 2.7, but it shouldn't be too hard.) — abarnert, Feb 06 '13 at 00:11
Actually, the _simplest_ workaround is probably just using threads instead of processes. Unless CUDA isn't thread-safe, or your code is CPU-bound (instead of just GPU-bound), that should work fine, right? — abarnert, Feb 06 '13 at 00:19
The CudaWorker processes are definitely more GPU bound than CPU bound so the Threading solution will likely work. If I remember correctly, ever since PyCuda 0.9 CUDA kernel calls release the GIL, so using Threading shouldn't cause any issues there either. — tnorwood, Feb 06 '13 at 00:32
Substituting the Thread approach for the Multiprocessing approach did the trick. Thanks for all the help. — tnorwood, Feb 06 '13 at 00:35

score 2 · Accepted Answer · answered Feb 06 '13 at 00:42

This is really just an educated guess based on my experienced, but I suspect that the OS X implementation of CUDA (or possibly PyCuda) relies on some APIs that can't be used safely after fork, while the linux implementation does not.* Since the POSIX implementation of multiprocessing uses fork without exec to create child processes, this would explain why it fails on OS X but not linux. (And on Windows, there is no fork, just a spawn equivalent, so this isn't an issue.)

The simplest solution would be to drop multiprocessing. If CUDA and PyCUDA are thread-safe (I don't know if they are), and your code is not CPU-bound (just GPU-bound), you might be able to just drop in threading.Thread in place of multiprocessing.Process and be done with it. Or you could consider one of the other parallel-processing libraries that provide similar APIs to multiprocessing. (There are a few people who use pp only because it always execs…)

However, it's pretty easy to hack up multiprocessing to exec/spawn a new Python interpreter and then do everything Windows-style instead of POSIX-style. (Getting every case right is difficult, but getting one specific use case right is easy.)

Or, if you look at bug #8713, there's some work being done on making this work right in general. And there are working patches. Those patches are for 3.3, not 2.7, so you'd probably need a bit of massaging, but it shouldn't be very much. So, just cp $MY_PYTHON_LIB/multiprocessing.py $MY_PROJECT_DIR/mymultiprocessing.py, patch it, use mymultiprocessing in place of multiprocessing, and add the appropriate call to pick spawn/fork+exec/whatever the mode is called in the latest patch before you do anything else.

* The OP says he suspected the same thing, so I probably don't need to explain this to him, but for future readers: It's not about a difference between Darwin and other Unixes, but about the fact that Apple ships a lot of non-Unix-y mid-level libraries like CoreFoundation.framework, Accelerate.framework, etc. that use unsafe-after-fork functionality (or just assert that they're not being used after a fork because Apple doesn't want to put in the rigorous testing that would be warranted before they could say "as of 10.X, Foo.framework is safe after fork"). Also, if you compare the way OS X and linux deal with graphics and other hardware, there's a lot more mid-level in-each-process-userspace going on in OS X.

PyCuda / Multiprocessing Issue on OS X 10.8

1 Answers1

Linked