3

I am looking for possibility to speed up computation of bincount using GPU.

Reference code in numpy:

x_new = numpy.random.randint(0, 1000, 1000000)
%timeit numpy.bincount(x_new)
100 loops, best of 3: 2.33 ms per loop

I want to measure only speed of operation, not the time spent on passing array, so I create a shared variable:

x = theano.shared(numpy.random.randint(0, 1000, 1000000))
theano_bincount = theano.function([], T.extra_ops.bincount(x))

This operation is of course highly parallelizable, but in practice on GPU this code is times slower than CPU version:

%timeit theano_bincount()
10 loops, best of 3: 25.7 ms per loop

So my questions are:

  1. What could be the reason for such low performance?
  2. Can I write parallel version of bincount using theano?
Alleo
  • 7,891
  • 2
  • 40
  • 30
  • 2
    Timing this in isolation isn't going to be particularly interesting because your GPU timings will be dominated by the cost of copying the input data from main memory to GPU memory and the outputs back again. And in fact it may be made worse by the fact that Theano GPU operations currently only support the `float32` data type. – Daniel Renshaw Dec 30 '15 at 10:54
  • @DanielRenshaw of course you're right, the bincount operation is needed to evaluate within loop many-many times on the data, that will be already on GPU with weights being each time recomputed on GPU. Bincount is the bottleneck, so I ask only to optimize it. – Alleo Dec 30 '15 at 14:17
  • @DanielRenshaw and from what you say, I conclude we cannot even store any other data type in the GPU's memory rather than float32. Am I right? – Amir Dec 30 '15 at 17:02
  • 1
    Correct (for now). The in development back end will support additional data types eventually. – Daniel Renshaw Dec 30 '15 at 17:03

2 Answers2

4

I think you cannot increase this operation on the GPU further unless you can somehow manually tell Theano to do in in a parallelized manner, which seems not to be possible. On the GPU, the computations that are not to be done in parallel will be done at the same speed or slower compared to CPU.

Quote from Daniel Renshaw:

To an extent, Theano expects you to focus more on what you want computed rather than on how you want it computed. The idea is that the Theano optimizing compiler will automatically parallelize as much as possible (either on GPU or on CPU using OpenMP).

And another quote:

You need to be able to specify your computation in terms of Theano operations. If those operations can be parallelized on the GPU, they should be parallelized automatically.

Quote from Theano's webpage:

  • Indexing, dimension-shuffling and constant-time reshaping will be equally fast on GPU as on CPU.
  • Summation over rows/columns of tensors can be a little slower on the GPU than on the CPU.

I think the only thing you can do is to set the openmp flag to True in your .theanorc file.

Anyway I tried an idea. It does not work for now, but hopefully someone can help us make it work. If worked, you might be able to parallelize the operation on the GPU. The code below tries to do EVERYTHING in the GPU with CUDA API. However, there are two bottle-necks not allowing the operation take place: 1) Currently (as of Jan. 4th, 2016) Theano and CUDA do not support any operations on any data type rather than float32 and 2) T.extra_ops.bincount() only works with int data types. So it might be the bottleneck for Theano not being able to fully parallelize the operation.

import theano.tensor as T
from theano import shared, Out, function
import numpy as np
import theano.sandbox.cuda.basic_ops as sbasic

shared_var = shared(np.random.randint(0, 1000, 1000000).astype(T.config.floatX), borrow = True)
x = T.vector('x');
computeFunc = T.extra_ops.bincount(sbasic.as_cuda_ndarray_variable(T.cast(x, 'int16')))
func = function([], Out(sbasic.gpu_from_host(computeFunc), borrow = True), givens = {x: shared_var})

Sources

1- How do I set many elements in parallel in theano

2- http://deeplearning.net/software/theano/tutorial/using_gpu.html#what-can-be-accelerated-on-the-gpu

3- http://deeplearning.net/software/theano/tutorial/multi_cores.html

Community
  • 1
  • 1
Amir
  • 10,600
  • 9
  • 48
  • 75
  • HI, Amir. I tested openmp - bincount does not get any speed up with threads (while other operations do). I wonder if bincount could be just a wrapper to numpy's bincount operation... – Alleo Dec 30 '15 at 14:23
  • @Alleo So I think the answer to your question is there is no way of parallelizing some functions until there is a way to either get around those bottle-necks or the possibility of parallelizing a function by itself. – Amir Dec 30 '15 at 15:11
0

try cupy library and use cupy.bincount() instead. That may give you better speed-up results.