3

I'm looking into speeding up my python code, which is all matrix math, using some form of CUDA. Currently my code is using Python and Numpy, so it seems like it shouldn't be too difficult to rewrite it using something like either PyCUDA or CudaMat.

However, on my first attempt using CudaMat, I realized I had to rearrange a lot of the equations in order to keep the operations all on the GPU. This included the creation of many temporary variables so I could store the results of the operations.

I understand why this is necessary, but it makes what were once easy to read equations into somewhat of a mess that difficult to inspect for correctness. Additionally, I would like to be able to easily modify the equations later on, which isn't in their converted form.

The package Theano manages to do this by first creating a symbolic representation of the operations, then compiling them to CUDA. However, after trying Theano out for a bit, I was frustrated by how opaque everything was. For example, just getting the actual value for myvar.shape[0] is made difficult since the tree doesn't get evaluated until much later. I would also much prefer less of a framework in which my code much conform to a library that acts invisibly in the place of Numpy.

Thus, what I would really like is something much simpler. I don't want automatic differentiation (there are other packages like OpenOpt that can do that if I require it), or optimization of the tree, but just a conversion from standard Numpy notation to CudaMat/PyCUDA/somethingCUDA. In fact, I want to be able to have it evaluate to just Numpy without any CUDA code for testing.

I'm currently considering writing this myself, but before even consider such a venture, I wanted to see if anyone else knows of similar projects or a good starting place. The only other project I know that might be close to this is SymPy, but I don't know how easy it would be to adapt to this purpose.

My current idea would be to create an array class that looked like a Numpy.array class. It's only function would be to build a tree. At any time, that symbolic array class could be converted to a Numpy array class and be evaluated (there would also be a one-to-one parity). Alternatively, the array class could be traversed and have CudaMat commands be generated. If optimizations are required they can be done at that stage (e.g. re-ordering of operations, creation of temporary variables, etc.) without getting in the way of inspecting what's going on.

Any thoughts/comments/etc. on this would be greatly appreciated!

Update

A usage case may look something like (where sym is the theoretical module), where we might be doing something such as calculating the gradient:

W = sym.array(np.rand(size=(numVisible, numHidden)))
delta_o = -(x - z)
delta_h = sym.dot(delta_o, W)*h*(1.0-h)
grad_W = sym.dot(X.T, delta_h)

In this case, grad_W would actually just be a tree containing the operations that needed to be done. If you wanted to evaluate the expression normally (i.e. via Numpy) you could do:

npGrad_W = grad_W.asNumpy()

which would just execute the Numpy commands that the tree represents. If on the other hand, you wanted to use CUDA, you would do:

cudaGrad_W = grad_W.asCUDA()

which would convert the tree into expressions that can executed via CUDA (this could happen in a couple of different ways).

That way it should be trivial to: (1) test grad_W.asNumpy() == grad_W.asCUDA(), and (2) convert your pre-existing code to use CUDA.

Abe Schneider
  • 977
  • 1
  • 11
  • 23
  • It's hard to comment without seeing example code. I'm not sure how this is different than Theano with features taken out (but I haven't used Theano myself). – Nathan Whitehead Jul 18 '11 at 19:59
  • The main differences between this proposal and Theano would be: (1) you wouldn't have to write things like: f = theano.function([myvars], output=foo, update=[x:somevar]) to have your equation evaluated -- you would just write standard Numpy equations that could then be converted to CUDA, (2) in Theano you can't simply inspect a variable -- it's stored in a tree that gets optimized before you can even look at it; additionally, nothing in the tree has an actual value until its evaluated so it can be difficult to debug, and (3) it would use PyCUDA or CudaMat instead of compiling code separately. – Abe Schneider Jul 18 '11 at 20:39
  • @Nathan: I'm adding a usage case above. – Abe Schneider Jul 18 '11 at 20:39
  • I think writing an abstraction layer of the operations you want to perform on those arrays will help a lot. Then the implementation details would be abstracted. Using numpy or PyCuda or you library is just an implementation detail. – fabrizioM Jul 21 '11 at 21:45
  • @fabrizioM: The thing is that Numpy is the abstraction level I want. I am playing around with the equations, so I can't really abstract any higher. I think the correct approach might be to simply copy the Numpy class structure but have the result either call CuBLAS or generate kernals for PyCuda. That way I can just change 'np.array(...)' to 'other.array(...)', but otherwise have everything work the same. I've seen a few attempts of people doing this, but nothing full-fledged yet. There's also numexpr, which looks interesting. – Abe Schneider Jul 22 '11 at 02:36

1 Answers1

2

Have you looked at the GPUArray portion of PyCUDA?

http://documen.tician.de/pycuda/array.html

While I haven't used it myself, it seems like it would be what you're looking for. In particular, check out the "Single-pass Custom Expression Evaluation" section near the bottom of that page.

Eli Stevens
  • 1,447
  • 1
  • 12
  • 21
  • I've come across it before, though at the time it didn't strike me immediately as what I wanted. However, I'll take a second look at it. Thanks! – Abe Schneider Aug 04 '11 at 19:00