I am trying to use the GPU with Theano. I've read this tutorial.
However, I can't get theano to use the GPU and I don't know how to continue.
Testing machine
$ cat /etc/issue
Welcome to openSUSE 12.1 "Asparagus" - Kernel \r (\l).
$ nvidia-smi -L
GPU 0: Tesla C2075 (S/N: 0324111084577)
$ echo $LD_LIBRARY_PATH
/usr/local/cuda-5.0/lib64:[other]:/usr/local/lib:/usr/lib:/usr/local/X11/lib:[other]
$ find /usr/local/ -name cuda_runtime.h
/usr/local/cuda-5.0/include/cuda_runtime.h
$ echo $C_INCLUDE_PATH
/usr/local/cuda-5.0/include/
$ echo $CXX_INCLUDE_PATH
/usr/local/cuda-5.0/include/
$ nvidia-smi -a
NVIDIA: could not open the device file /dev/nvidiactl (Permission denied).
Failed to initialize NVML: Insufficient Permissions
$ echo $PATH
/usr/lib64/mpi/gcc/openmpi/bin:/home/mthoma/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:.:/home/mthoma/bin
$ ls -l /dev/nv*
crw-rw---- 1 root video 195, 0 1. Jul 09:47 /dev/nvidia0
crw-rw---- 1 root video 195, 255 1. Jul 09:47 /dev/nvidiactl
crw-r----- 1 root kmem 10, 144 1. Jul 09:46 /dev/nvram
# nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Wed Jul 30 05:13:52 2014
Driver Version : 304.33
Attached GPUs : 1
GPU 0000:04:00.0
Product Name : Tesla C2075
Display Mode : Enabled
Persistence Mode : Disabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324111084577
GPU UUID : GPU-7ea505ef-ad46-bb24-c440-69da9b300040
VBIOS Version : 70.10.46.00.05
Inforom Version
Image Version : N/A
OEM Object : 1.1
ECC Object : 2.0
Power Management Object : 4.0
PCI
Bus : 0x04
Device : 0x00
Domain : 0x0000
Device Id : 0x109610DE
Bus Id : 0000:04:00.0
Sub System Id : 0x091010DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : 30 %
Performance State : P12
Clocks Throttle Reasons : N/A
Memory Usage
Total : 5375 MB
Used : 39 MB
Free : 5336 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 5 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 133276
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Total : 133276
Double Bit
Device Memory : 203730
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Total : 203730
Temperature
Gpu : 58 C
Power Readings
Power Management : Supported
Power Draw : 33.83 W
Power Limit : 225.00 W
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 50 MHz
SM : 101 MHz
Memory : 135 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 573 MHz
SM : 1147 MHz
Memory : 1566 MHz
Compute Processes : None
Cuda sample
Compiling and executing worked as a super user (tested with cuda/C/0_Simple/simpleMultiGPU
):
# ldconfig /usr/local/cuda-5.0/lib64/
# ./simpleMultiGPU
[simpleMultiGPU] starting...
CUDA-capable device count: 1
Generating input data...
Computing with 1 GPUs...
GPU Processing time: 27.814000 (ms)
Computing with Host CPU...
Comparing GPU and Host CPU results...
GPU sum: 16777296.000000
CPU sum: 16777294.395033
Relative difference: 9.566307E-08
[simpleMultiGPU] test results...
PASSED
> exiting in 3 seconds: 3...2...1...done!
When I try this as normal user, I get:
$ ./simpleMultiGPU
[simpleMultiGPU] starting...
CUDA error at simpleMultiGPU.cu:87 code=38(cudaErrorNoDevice) "cudaGetDeviceCount(&GPU_N)"
CUDA-capable device count: 0
Generating input data...
Floating point exception
How can I get cuda to work with non-super users?
Testing code
The following code is from "Testing Theano with GPU"
#!/usr/bin/env python
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print 'Used the cpu'
else:
print 'Used the gpu'
The error message
The complete error message is much too long to post it here. A longer version is on http://pastebin.com/eT9vbk7M, but I think the relevant part is:
cc1plus: fatal error: cuda_runtime.h: No such file or directory
compilation terminated.
ERROR (theano.sandbox.cuda): Failed to compile cuda_ndarray.cu: ('nvcc return status', 1, 'for cmd', 'nvcc -shared -g -O3 -m64 -Xcompiler -DCUDA_NDARRAY_CUH=bcb411d72e41f81f3deabfc6926d9728,-D NPY_ARRAY_ENSURECOPY=NPY_ENSURECOPY,-D NPY_ARRAY_ALIGNED=NPY_ALIGNED,-D NPY_ARRAY_WRITEABLE=NPY_WRITEABLE,-D NPY_ARRAY_UPDATE_ALL=NPY_UPDATE_ALL,-D NPY_ARRAY_C_CONTIGUOUS=NPY_C_CONTIGUOUS,-D NPY_ARRAY_F_CONTIGUOUS=NPY_F_CONTIGUOUS,-fPIC -Xlinker -rpath,/home/mthoma/.theano/compiledir_Linux-3.1.10-1.16-desktop-x86_64-with-SuSE-12.1-x86_64-x86_64-2.7.2/cuda_ndarray -Xlinker -rpath,/usr/local/cuda-5.0/lib -Xlinker -rpath,/usr/local/cuda-5.0/lib64 -I/usr/local/lib/python2.7/site-packages/Theano-0.6.0rc1-py2.7.egg/theano/sandbox/cuda -I/usr/local/lib/python2.7/site-packages/numpy-1.6.2-py2.7-linux-x86_64.egg/numpy/core/include -I/usr/include/python2.7 -o /home/mthoma/.theano/compiledir_Linux-3.1.10-1.16-desktop-x86_64-with-SuSE-12.1-x86_64-x86_64-2.7.2/cuda_ndarray/cuda_ndarray.so mod.cu -L/usr/local/cuda-5.0/lib -L/usr/local/cuda-5.0/lib64 -L/usr/lib64 -lpython2.7 -lcublas -lcudart')
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available
The standard stream gives:
['nvcc', '-shared', '-g', '-O3', '-m64', '-Xcompiler', '-DCUDA_NDARRAY_CUH=bcb411d72e41f81f3deabfc6926d9728,-D NPY_ARRAY_ENSURECOPY=NPY_ENSURECOPY,-D NPY_ARRAY_ALIGNED=NPY_ALIGNED,-D NPY_ARRAY_WRITEABLE=NPY_WRITEABLE,-D NPY_ARRAY_UPDATE_ALL=NPY_UPDATE_ALL,-D NPY_ARRAY_C_CONTIGUOUS=NPY_C_CONTIGUOUS,-D NPY_ARRAY_F_CONTIGUOUS=NPY_F_CONTIGUOUS,-fPIC', '-Xlinker', '-rpath,/home/mthoma/.theano/compiledir_Linux-3.1.10-1.16-desktop-x86_64-with-SuSE-12.1-x86_64-x86_64-2.7.2/cuda_ndarray', '-Xlinker', '-rpath,/usr/local/cuda-5.0/lib', '-Xlinker', '-rpath,/usr/local/cuda-5.0/lib64', '-I/usr/local/lib/python2.7/site-packages/Theano-0.6.0rc1-py2.7.egg/theano/sandbox/cuda', '-I/usr/local/lib/python2.7/site-packages/numpy-1.6.2-py2.7-linux-x86_64.egg/numpy/core/include', '-I/usr/include/python2.7', '-o', '/home/mthoma/.theano/compiledir_Linux-3.1.10-1.16-desktop-x86_64-with-SuSE-12.1-x86_64-x86_64-2.7.2/cuda_ndarray/cuda_ndarray.so', 'mod.cu', '-L/usr/local/cuda-5.0/lib', '-L/usr/local/cuda-5.0/lib64', '-L/usr/lib64', '-lpython2.7', '-lcublas', '-lcudart']
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 3.25972604752 seconds
Result is [ 1.23178029 1.61879337 1.52278066 ..., 2.20771813 2.29967761
1.62323284]
Used the cpu
theano.rc
$ cat .theanorc
[global]
device = gpu
floatX = float32
[cuda]
root = /usr/local/cuda-5.0