CUDA runtime gpu initialization with theano

Question

I am trying to parallelize my NN across two GPUs following https://github.com/uoguelph-mlrg/theano_multi_gpu. I have all the dependencies, but the cuda runtime initialization fails with the following message.

ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device 0 failed:
cublasCreate() returned this error 'the CUDA Runtime initialization failed'
Error when trying to find the memory information on the GPU: invalid device ordinal
Error allocating 24 bytes of device memory (invalid device ordinal). Driver report 0 bytes free and 0 bytes total
ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
CudaNdarray_ZEROS: allocation failed.
Process Process-1:
Traceback (most recent call last):
  File "/opt/share/Python-2.7.9/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/share/Python-2.7.9/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/u/bsankara/nt/Git-nt/nt/train_attention.py", line 171, in launch_train
    clip_c=1.)
  File "/u/bsankara/nt/Git-nt/nt/nt.py", line 1616, in train
    import theano.sandbox.cuda
  File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/__init__.py", line 98, in <module>
    theano.sandbox.cuda.tests.test_driver.test_nvidia_driver1()
  File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/sandbox/cuda/tests/test_driver.py", line 30, in test_nvidia_driver1
    A = cuda.shared_constructor(a)
  File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/sandbox/cuda/var.py", line 181, in float32_shared_constructor
    enable_cuda=False)
  File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py", line 389, in use
    cuda_ndarray.cuda_ndarray.CudaNdarray.zeros((2, 3))
RuntimeError: ('CudaNdarray_ZEROS: allocation failed.', 'You asked to force this device and it failed. No fallback to the cpu or other gpu device.')

The relevant part of the code snippet is here:

from multiprocessing import Queue
import zmq
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray

def train(private_args, process_env, <some other args>)
    if process_env is not None:
       os.environ = process_env

    ####
    # pycuda and zmq environment

    drv.init()
    dev = drv.Device(private_args['ind_gpu'])
    ctx = dev.make_context()
    sock = zmq.Context().socket(zmq.PAIR)

    if private_args['flag_client']:
        sock.connect('tcp://localhost:5000')
    else:
        sock.bind('tcp://*:5000')

    ####
    # import theano stuffs
    import theano.sandbox.cuda
    theano.sandbox.cuda.use(private_args['gpu'])

    import theano
    import theano.tensor as tensor
    from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
    import theano.misc.pycuda_init
    import theano.misc.pycuda_utils
...

The error is triggered when it imports theano.sandbox.cuda. And this is where, I launch the training function as two processes.

def launch_train(curr_args, process_env, curr_queue, oth_queue):
    trainerr, validerr, testerr = train(private_args=curr_args,
                                        process_env=process_env,
                                         ...)

process1_env = os.environ.copy()
process1_env['THEANO_FLAGS'] = "cuda.root=/opt/share/cuda-7.0,device=gpu0,floatX=float32,on_unused_input=ignore,optimizer=fast_run,exception_verbosity=high,compiledir=/u/bsankara/.theano/NT_multi_GPU1"
process2_env = os.environ.copy()
process2_env['THEANO_FLAGS'] = "cuda.root=/opt/share/cuda-7.0,device=gpu1,floatX=float32,on_unused_input=ignore,optimizer=fast_run,exception_verbosity=high,compiledir=/u/bsankara/.theano/NT_multi_GPU2"

p = Process(target=launch_train,
                args=(p_args, process1_env, queue_p, queue_q))
q = Process(target=launch_train,
                args=(q_args, process2_env, queue_q, queue_p))

p.start()
q.start()
p.join()
q.join()

The import statement however seem to work if I try to initialize the gpu interactively in Python. I executed the first 20 lines of the train() and it worked fine there and also correctly assigned me to gpu0 as I requested.

I tried some debugging with pdb and it seem to fail in the /opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py file, in function `def use(device, force=False, default_to_move_computation_to_gpu=True, move_shared_float32_to_gpu=True, enable_cuda=True, test_driver=True):` Particularly, it crashes in the command `gpu_init(device)`. `device` has the value of `0`, from `gpu0` and it fails with the message: RuntimeError: "cublasCreate() returned this error 'the CUDA Runtime initialization failed'" — baskaran, Sep 25 '15 at 05:42
Does the `dual_mlp.py` code (in the GitHub repository you linked to) run without modification? Have you tried falling back to the original/official documentation on this topic (https://github.com/Theano/Theano/wiki/Using-Multiple-GPUs)? — Daniel Renshaw, Sep 25 '15 at 06:56
@Daniel, The official documentation and the dual_mlp.py folks use the same approach. They both launch sub-processes and then import `theano.sandbox.cuda` for binding with gpu. The only difference AFAIK is that the dual_mlp.py uses the PyCUDA functions for inter-process communication that does GPU to GPU transfer, to avoid latency involved in tunneling through host memory. The official doc, proposes using multiprocessing Queue. I didn't try running the dual_mlp.py myself, but had a personal communication with one of the authors and he indicated that it worked for them. Will crosscheck that. — baskaran, Sep 25 '15 at 12:49

score 0 · Accepted Answer · answered Jun 23 '16 at 09:52

After digging around and running pdb, the original poster found the issue.

Basically theano and pycuda were both competing to initialize the gpu, causing the problem. The solution is to first 'import theano', which would get a gpu and then attach to the specific context in pycuda. So, the import sections within train function would look like this:

def train(private_args, process_env, <some other args>)
    if process_env is not None:
       os.environ = process_env

    ####
    # import theano related
    # We need global imports and so we make them as such
    theano = __import__('theano')
    _t_tensor = __import__('theano', globals(), locals(), ['tensor'], -1)
    tensor = _t_tensor.tensor

    import theano.sandbox.cuda
    import theano.misc.pycuda_utils

    ####
    # pycuda and zmq environment
    import zmq
    import pycuda.driver as drv
    import pycuda.gpuarray as gpuarray

    drv.init()
    # Attach the existing context (already initialized by theano import statement)
    ctx = drv.Context.attach()
    sock = zmq.Context().socket(zmq.PAIR)

    if private_args['flag_client']:
        sock.connect('tcp://localhost:5000')
    else:
        sock.bind('tcp://*:5000')

[This answer was added as a community wiki entry from an edit made by the OP in a attempt to get this question off the unaswered list].

CUDA runtime gpu initialization with theano

1 Answers1