I think you cannot increase this operation on the GPU further unless you can somehow manually tell Theano to do in in a parallelized manner, which seems not to be possible. On the GPU, the computations that are not to be done in parallel will be done at the same speed or slower compared to CPU.
Quote from Daniel Renshaw:
To an extent, Theano expects you to focus more on what you want
computed rather than on how you want it computed. The idea is that the
Theano optimizing compiler will automatically parallelize as much as
possible (either on GPU or on CPU using OpenMP).
And another quote:
You need to be able to specify your computation in terms of Theano operations. If those operations can be parallelized on the GPU, they should be parallelized automatically.
Quote from Theano's webpage:
- Indexing, dimension-shuffling and constant-time reshaping will be
equally fast on GPU as on CPU.
- Summation over rows/columns of tensors
can be a little slower on the GPU than on the CPU.
I think the only thing you can do is to set the openmp
flag to True
in your .theanorc
file.
Anyway I tried an idea. It does not work for now, but hopefully someone can help us make it work. If worked, you might be able to parallelize the operation on the GPU. The code below tries to do EVERYTHING in the GPU with CUDA API. However, there are two bottle-necks not allowing the operation take place: 1) Currently (as of Jan. 4th, 2016) Theano and CUDA do not support any operations on any data type rather than float32 and 2) T.extra_ops.bincount()
only works with int
data types. So it might be the bottleneck for Theano not being able to fully parallelize the operation.
import theano.tensor as T
from theano import shared, Out, function
import numpy as np
import theano.sandbox.cuda.basic_ops as sbasic
shared_var = shared(np.random.randint(0, 1000, 1000000).astype(T.config.floatX), borrow = True)
x = T.vector('x');
computeFunc = T.extra_ops.bincount(sbasic.as_cuda_ndarray_variable(T.cast(x, 'int16')))
func = function([], Out(sbasic.gpu_from_host(computeFunc), borrow = True), givens = {x: shared_var})
Sources
1- How do I set many elements in parallel in theano
2- http://deeplearning.net/software/theano/tutorial/using_gpu.html#what-can-be-accelerated-on-the-gpu
3- http://deeplearning.net/software/theano/tutorial/multi_cores.html