3

I am just getting started with Theano and Deep Learning. I was experimenting with an example from the Theano tutorial (http://deeplearning.net/software/theano/tutorial/using_gpu.html#returning-a-handle-to-device-allocated-data). The example code is shown here:

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

I am trying to understand the expression defining 'vlen',

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core

I can't find anywhere in the text that refers to the number of GPU cores specified in this example and why 30 was selected. Nor can I find why the value of 768 threads was used. My GPU (GeForce 840M) has 384 cores. Can I assume that if I substitute 384 in the place of the value of 30, that I will be using all 384 cores ? Also should the value of 768 threads remain fixed ?

talonmies
  • 70,661
  • 34
  • 192
  • 269
crayguy
  • 75
  • 6
  • I'm pretty sure the intent of the comment is to suggest that the problem size to be created (`vlen`) is/should be large enough to be "interesting" on the GPU. CUDA codes, including the underpinnings of theano that use CUDA, do not normally specify the number of cores or the number of threads per core (I can only assume here that "core" = "SM", which is not the usual definition, but the only one that makes any sense). After all, `vlen` here ultimately is just a number, the length of an array. If you run the code as-is, it will use all your GPU cores. There's no magic in any of (10,30,768). – Robert Crovella Jan 04 '16 at 20:56
  • That's why I was having difficulty with the definition of 'vlen'. There doesn't seem to be any reason to express it that why. It actually seems misleading. – crayguy Jan 05 '16 at 05:55
  • Yes, and it is quite frustrating since it is part of a tutorial. – crayguy Jan 05 '16 at 06:06

1 Answers1

2

I believe the logic is as follows. Looking at the referenced page, we see that there is mention of a GTX 275 GPU. So the GPU being used for that tutorial may have been a very old CUDA GPU from the cc1.x generation (no longer supported by CUDA 7.0 and 7.5). In the comment, the developer seems to be using the word "core" to refer to a GPU SM (multiprocessor).

There were a number of GPUs in that family that had 30 SMs (a cc1.x SM was a very different animal than a cc 2+ SM), including GTX 275 (240 CUDA cores = 30SMs * 8cores/SM in the cc1.x generation). So the 30 number is derived from the number of SMs in the GPU being used at the time.

Furthermore, if you review old documentation for CUDA versions that supported such GPUs, you will find that cc1.0 and cc1.1 GPUs supported a max of 768 threads per multiprocessor (SM). So I believe this is where the 768 number comes from.

Finally, a good CUDA code will oversubscribe the GPU (total number of threads is more than what the GPU can instantaneously handle). So I believe the factor of 10 is just to ensure "oversubscription".

There is no magic to a particular number -- it is just the length of an array (vlen). The length of this array, after it flows through the theano framework, will ultimately determine the number of threads in CUDA kernel launch. This code isn't really a benchmark or other performance indicator. It's stated purpose is just to demonstrate that the GPU is being used.

So I wouldn't read too much into that number. It was a casual choice by the developer that followed a certain amount of logic pertaining to the GPU at hand.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257