Is there a GPU accelerated numpy.max(X, axis=0) implementation in Theano?

Question

Do we have a GPU accelerated of version of numpy.max(X, axis=None) in Theano. I looked into the documentation and found theano.tensor.max(X, axis=None), but it is 4-5 times slower than the numpy implementation.

I can assure you, it is not slow because of some bad choice of matrix size. Same matrix under theano.tensor.exp is 40 times faster than its numpy counterpart.

Any suggestions?

score 5 · Accepted Answer · answered May 22 '14 at 15:24

The previous answer is partial. The suggestion should not work, as the work around is the one used in the final compiled code. There is optimization that will do this transformation automatically.

The title of the question isn't the same as the content. They differ by the axis argument. I'll answer both questions.

If the axis is 0 or None we support this on the GPU for that operation for matrix. If the axis is None, we have a basic implementation that isn't well optimized as it is harder to parallelize. If the axis is 0, we have a basic implementation, but it is faster as it is easier to parallelize.

Also, how did you do your timing? If you just make one function with only that operation and test it via the device=gpu flags to do your comparison, this will include the transfer time between CPU and GPU. This is a memory bound operation, so if you include the transfer in your timming, personnaly I don't expect any speed op for that case. To see only the GPU operation, use Theano profiler: run with the Theano flag profile=True.

Yes, i was timing whole copy+computation.I will try Theano profiler. I wish to speedup `the max(X, axis=0)` operation including the copying overheads. I don't seem to be able to get that for any matrix sizes. What do you suggest? — hrs, May 22 '14 at 16:24
As I tried to tell in the answer, I don't think you will be able to get speed up for max reduction if you include the transfer time with any system, not just Theano. To do the reduction on the CPU, the bottleneck is the reading from memory. Doing the transfer to the GPU is going at a slower speed then reading from the CPU memory by the CPU cores. If you want GPU speed up that include the transfer, you need a more computation to be done on the GPU. — nouiz, May 22 '14 at 18:57

score 3 · Answer 2 · answered May 22 '14 at 03:05

The max and exp operations are fundamentally different; exp (and other operations like addition, sin, etc.) is an elementwise operation that is embarrassingly parallelizable, while max requires a parallel-processing scan algorithm that basically builds up a tree of pairwise comparisons over an array. It's not impossible to speed up max, but it's not as easy as exp.

Anyway, the theano implementation of max basically consists of the following lines (in theano/tensor/basic.py):

try:
    out = max_and_argmax(x, axis)[0]
except Exception:
    out = CAReduce(scal.maximum, axis)(x)

where max_and_argmax is a bunch of custom code that, to my eye, implements a max+argmax operation using numpy, and CAReduce is a generic GPU-accelerated scan operation used as a fallback (which, according to the comments, doesn't support grad etc.). You could try using the fallback directly and see whether that is faster, maybe something like this:

from theano.tensor.elemwise import CAReduce
from theano.scalar import maximum

def mymax(X, axis=None):
    CAReduce(maximum, axis)(X)

That didn't seem to improve things, atleast the performance is now equal to that of the CPU. I tried various matrix sizes just to be sure. — hrs, May 22 '14 at 06:04

Is there a GPU accelerated numpy.max(X, axis=0) implementation in Theano?

2 Answers2