I created a simple python script (using Theano) performing linear regression which should be run on GPU. When code starts it says "using gpu device", but (according to the profiler) all operations are CPU-specific (ElemWise, instead of GpuElemWise, no GpuFromHost etc.).
I checked the variables, THEANO_FLAGS, everything seems right and I cannot see the catch (especially when Theano tutorials with the same settings are correctly run on GPU :)).
Here is the code:
# linear regression
import numpy
import theano
import theano.tensor as T
input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])
output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])
TS = theano.shared(input_data, "training-set")
E = theano.shared(output_data, "expected")
W1 = theano.shared(numpy.zeros((1, 2)))
O = T.dot(TS, W1.T)
cost = T.mean(T.sqr(E - O.T))
gradient = T.grad(cost=cost, wrt=W1)
update = [[W1, W1 - gradient * 0.0001]]
train = theano.function([], cost, updates=update, allow_input_downcast=True)
for i in range(1000):
train()
- THEANO_FLAGS=cuda.root=/usr/local/cuda
- device=gpu
- floatX=float32
- lib.cnmem=.5
- profile=True
- CUDA_LAUNCH_BLOCKING=1
Output:
Using gpu device 0: GeForce GT 650M (CNMeM is enabled)
Function profiling
==================
Message: /home/mw/Documents/LiClipse Workspace/theano1/test2.py:18
Time in 1000 calls to Function.__call__: 3.348637e-02s
Time in Function.fn.__call__: 2.419019e-02s (72.239%)
Time in thunks: 1.839781e-02s (54.941%)
Total compile time: 1.350801e-01s
Number of Apply nodes: 18
Theano Optimizer time: 1.101730e-01s
Theano validate time: 2.029657e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 1.491690e-02s
Import time 2.320528e-03s
Time in all call to theano.grad() 8.740902e-03s
Time since theano import 0.881s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
71.7% 71.7% 0.013s 6.59e-06s Py 2000 2 theano.tensor.basic.Dot
12.3% 83.9% 0.002s 3.22e-07s C 7000 7 theano.tensor.elemwise.Elemwise
5.7% 89.6% 0.001s 3.50e-07s C 3000 3 theano.tensor.elemwise.DimShuffle
4.0% 93.6% 0.001s 3.65e-07s C 2000 2 theano.tensor.subtensor.Subtensor
3.6% 97.2% 0.001s 3.31e-07s C 2000 2 theano.compile.ops.Shape_i
1.7% 98.9% 0.000s 3.06e-07s C 1000 1 theano.tensor.opt.MakeVector
1.1% 100.0% 0.000s 2.10e-07s C 1000 1 theano.tensor.elemwise.Sum
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
71.7% 71.7% 0.013s 6.59e-06s Py 2000 2 dot
4.0% 75.6% 0.001s 3.65e-07s C 2000 2 Subtensor{int64}
3.5% 79.1% 0.001s 6.35e-07s C 1000 1 InplaceDimShuffle{1,0}
3.3% 82.4% 0.001s 6.06e-07s C 1000 1 Elemwise{mul,no_inplace}
2.4% 84.8% 0.000s 4.38e-07s C 1000 1 Shape_i{0}
2.3% 87.1% 0.000s 4.29e-07s C 1000 1 Elemwise{Composite{((i0 * i1) / i2)}}
2.3% 89.3% 0.000s 2.08e-07s C 2000 2 InplaceDimShuffle{x,x}
1.8% 91.1% 0.000s 3.25e-07s C 1000 1 Elemwise{Cast{float64}}
1.7% 92.8% 0.000s 3.06e-07s C 1000 1 MakeVector{dtype='int64'}
1.5% 94.3% 0.000s 2.78e-07s C 1000 1 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
1.4% 95.7% 0.000s 2.53e-07s C 1000 1 Elemwise{Sub}[(0, 1)]
1.2% 96.9% 0.000s 2.24e-07s C 1000 1 Shape_i{1}
1.1% 98.0% 0.000s 2.10e-07s C 1000 1 Sum{acc_dtype=float64}
1.1% 99.1% 0.000s 1.98e-07s C 1000 1 Elemwise{Sqr}[(0, 0)]
0.9% 100.0% 0.000s 1.66e-07s C 1000 1 Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)]
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
37.8% 37.8% 0.007s 6.95e-06s 1000 3 dot(<TensorType(float64, matrix)>, training-set.T)
33.9% 71.7% 0.006s 6.24e-06s 1000 14 dot(Elemwise{Composite{((i0 * i1) / i2)}}.0, training-set)
3.5% 75.1% 0.001s 6.35e-07s 1000 0 InplaceDimShuffle{1,0}(training-set)
3.3% 78.4% 0.001s 6.06e-07s 1000 11 Elemwise{mul,no_inplace}(InplaceDimShuffle{x,x}.0, InplaceDimShuffle{x,x}.0)
3.0% 81.4% 0.001s 5.58e-07s 1000 8 Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{1})
2.4% 83.8% 0.000s 4.38e-07s 1000 2 Shape_i{0}(expected)
2.3% 86.2% 0.000s 4.29e-07s 1000 12 Elemwise{Composite{((i0 * i1) / i2)}}(TensorConstant{(1, 1) of -2.0}, Elemwise{Sub}[(0, 1)].0, Elemwise{mul,no_inplace}.0)
1.8% 87.9% 0.000s 3.25e-07s 1000 6 Elemwise{Cast{float64}}(MakeVector{dtype='int64'}.0)
1.7% 89.6% 0.000s 3.06e-07s 1000 4 MakeVector{dtype='int64'}(Shape_i{0}.0, Shape_i{1}.0)
1.6% 91.2% 0.000s 3.03e-07s 1000 10 InplaceDimShuffle{x,x}(Subtensor{int64}.0)
1.5% 92.7% 0.000s 2.78e-07s 1000 16 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](<TensorType(float64, matrix)>, TensorConstant{(1, 1) of ..974738e-05}, dot.0)
1.4% 94.1% 0.000s 2.53e-07s 1000 5 Elemwise{Sub}[(0, 1)](expected, dot.0)
1.2% 95.3% 0.000s 2.24e-07s 1000 1 Shape_i{1}(expected)
1.1% 96.5% 0.000s 2.10e-07s 1000 15 Sum{acc_dtype=float64}(Elemwise{Sqr}[(0, 0)].0)
1.1% 97.6% 0.000s 1.98e-07s 1000 13 Elemwise{Sqr}[(0, 0)](Elemwise{Sub}[(0, 1)].0)
0.9% 98.5% 0.000s 1.72e-07s 1000 7 Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{0})
0.9% 99.4% 0.000s 1.66e-07s 1000 17 Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](Sum{acc_dtype=float64}.0, Subtensor{int64}.0, Subtensor{int64}.0)
0.6% 100.0% 0.000s 1.13e-07s 1000 9 InplaceDimShuffle{x,x}(Subtensor{int64}.0)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)