OpenCL Index operations: algorithmic vs constant index buffer

Question

So I'm writing a neural network library using Aparapi (which generates OpenCL from Java code). Anyway there are many situations where I need to do complex index operations to find the source/destination node for a given weight when doing forward passes and backpropagation.

In many cases this is very simple 1D to 2D formula, but in some cases, such as for convolution nets, I need to do a somewhat more complex operation to find the index (often something like 3D to 1D to 3D).

I have been sticking with algorthims to compute these indices. The alternative would be to simple store the source and destination indices for each weight in a constant int array. I have avoided this as this would almost double the amount of memory storage.

I was wondering what the speed differences would be for computing indices vs reading them from a constant array? Am I losing speed in exchange for memory? Is the difference significant?

Have you tried to benchmark it? Memory access speeds and latencies vary a lot from device to device, especially between GPU's and CPU's, so it really depends on your hardware, your specific algorithm, and so on. — Thomas, Aug 18 '13 at 05:47
If indices are like a[gid*3+x*15], striding can stop using some memory banks, computing indices qouldnt be a problem near this thing. — huseyin tugrul buyukisik, Aug 18 '13 at 05:51
@Thomas:I haven't tested it with a constant index buffer, but I am trying to design the software to be rather portable, so I'm more interested general performance than performance on my hardware. — technotheist, Aug 18 '13 at 06:04
Although, that being said, I am designing it for use on GPU/APU — technotheist, Aug 18 '13 at 06:09
If I was to use an index buffer, it would have to look something like this: `float[] nodes; float[] weights; int inputsPerNode; int[] weightSrc; fwd() { float sum = 0; int i0 = gid() * inputsPerNode; for(int i = 0; i < inputPerNode; i++) { sum += weights[i0 + i] * nodes[weightSrc[i0 + i]]; } nodes[gid()] = f(sum); } — technotheist, Aug 18 '13 at 06:15
Aparapi has some support for local memory using (@Local int[] foo) or naming the buffer int[] foo_local. Also there is experimental support for Constant memory, but your GPU will dictate how much of each you have to play with. Observe the limits, they are dramatically enforced ;) The NBody sample code has an example using local memory. The Mandel example uses constant for the palette (I think). — gfrost, Aug 19 '13 at 20:00

score 0 · Answer 1 · answered Aug 23 '13 at 00:50

Computation is almost always faster on the GPU than global memory access to do the same thing (like a look-up-table). In particular, because the GPU keeps so many kernels "in flight" the math happens while it is waiting on the I/O from the previous kernel slot. So if your math is not too complex, prefer to do it rather than burn a global memory access.

OpenCL Index operations: algorithmic vs constant index buffer

1 Answers1