1

So I'm writing a neural network library using Aparapi (which generates OpenCL from Java code). Anyway there are many situations where I need to do complex index operations to find the source/destination node for a given weight when doing forward passes and backpropagation.

In many cases this is very simple 1D to 2D formula, but in some cases, such as for convolution nets, I need to do a somewhat more complex operation to find the index (often something like 3D to 1D to 3D).

I have been sticking with algorthims to compute these indices. The alternative would be to simple store the source and destination indices for each weight in a constant int array. I have avoided this as this would almost double the amount of memory storage.

I was wondering what the speed differences would be for computing indices vs reading them from a constant array? Am I losing speed in exchange for memory? Is the difference significant?

  • Have you tried to benchmark it? Memory access speeds and latencies vary a lot from device to device, especially between GPU's and CPU's, so it really depends on your hardware, your specific algorithm, and so on. – Thomas Aug 18 '13 at 05:47
  • If indices are like a[gid*3+x*15], striding can stop using some memory banks, computing indices qouldnt be a problem near this thing. – huseyin tugrul buyukisik Aug 18 '13 at 05:51
  • @Thomas:I haven't tested it with a constant index buffer, but I am trying to design the software to be rather portable, so I'm more interested general performance than performance on my hardware. – technotheist Aug 18 '13 at 06:04
  • Although, that being said, I am designing it for use on GPU/APU – technotheist Aug 18 '13 at 06:09
  • If I was to use an index buffer, it would have to look something like this: `float[] nodes; float[] weights; int inputsPerNode; int[] weightSrc; fwd() { float sum = 0; int i0 = gid() * inputsPerNode; for(int i = 0; i < inputPerNode; i++) { sum += weights[i0 + i] * nodes[weightSrc[i0 + i]]; } nodes[gid()] = f(sum); } – technotheist Aug 18 '13 at 06:15
  • Aparapi has some support for local memory using (@Local int[] foo) or naming the buffer int[] foo_local. Also there is experimental support for Constant memory, but your GPU will dictate how much of each you have to play with. Observe the limits, they are dramatically enforced ;) The NBody sample code has an example using local memory. The Mandel example uses constant for the palette (I think). – gfrost Aug 19 '13 at 20:00

1 Answers1

0

Computation is almost always faster on the GPU than global memory access to do the same thing (like a look-up-table). In particular, because the GPU keeps so many kernels "in flight" the math happens while it is waiting on the I/O from the previous kernel slot. So if your math is not too complex, prefer to do it rather than burn a global memory access.

Dithermaster
  • 6,223
  • 1
  • 12
  • 20