So I'm designing a CNN in Java and I'm down to the point where I really wanna parallelize the convolution and pooling. This is my approach(rows, columns, inputLayer, convLayer, poolLayer and features have been initialized already in the constructor):
int padding = 3;
int filterSize = 2 * padding + 1;
int[] input = new int[rows * columns];
for(int r = 0; r < rows; r++)
System.arraycopy(inputLayer[r], 0, input, r * columns, columns);
int[] filters = new int[4 * filterSize * filterSize];
for(int fl = 0; fl < 4; fl++)
for(int fr = 0; fr < filterSize; fr++)
System.arraycopy(features[fl][fr], 0, filters, fl * filterSize * filterSize + fr * filterSize, filterSize);
float[] conv = new float[4 * rows * columns];
float[] pool = new float[rows * columns];
Range convRange = Range.create3D(columns, rows, 4, 2, 2, 2);
Kernel convKernel = new Kernel(){
int h = rows;
int w = columns;
int p = padding;
int fs = filterSize;
public void run(){
int val = 0;
int c = getGlobalId(0);
int r = getGlobalId(1);
int l = getGlobalId(2);
int upper = max(0, p - r);
int lower = min(fs, h + p - r);
int left = max(0, p - c);
int right = min(fs, w + p - c);
for (int i = upper; i < lower; i++)
for (int j = left; j < right; j++)
val += input[(r + i - p) * w + c + j - p] * filters[l * fs * fs + i * fs + j];
conv[l * h * w + r * w + c] = Math.round(100.00f * val / fs) / 100.00f;
}
};
convKernel.setExplicit(true);
convKernel.put(input);
convKernel.put(conv);
convKernel.put(filters);
convKernel.execute(convRange);
convKernel.get(conv);
for(int convL = 0; convL < 4; convL++)
for(int convR = 0; convR < rows; convR++)
System.arraycopy(conv, convL * rows * columns + convR * columns, convLayer[convL][convR], 0, columns);
Range poolRange = Range.create3D(columns / 2, rows / 2, 4, 2, 2, 2);
Kernel poolKernel = new Kernel(){
public void run(){
int wt = columns;
int ht = rows;
float val = 0.00f;
int c = getGlobalId(0);
int r = getGlobalId(1);
int l = getGlobalId(2);
for(int i = 0; i < 2; i++)
for(int j = 0; j < 2; j++)
val = max(val, leakyReLU(conv[l * ht * wt + (2 * r + i) * wt + 2 * c + j]));
pool[(l * ht * wt / 4) + (r * wt / 2) + c] = Math.round(100.00f * val) / 100.00f;
}
};
poolKernel.setExplicit(true);
poolKernel.put(conv);
poolKernel.put(pool);
poolKernel.execute(poolRange);
poolKernel.get(pool);
for(int poolL = 0; poolL < 4; poolL++)
for(int poolR = 0; poolR < rows / 2; poolR++)
System.arraycopy(pool, (poolL * rows * columns / 4) + (poolR * columns / 2), poolLayer[poolL][poolR], 0, columns / 2);
Not the prettiest piece of code but I haven't used Java in ages, let alone Aparapi.
Initially I used directly the original arrays, but the api showed a message that it doesn't support them and switched to native mode. Converting everything to 1d arrays is supposed to work but now I get this message:
VIII 09, 2022 9:03:02 PM com.aparapi.internal.model.MethodModel init WARNING: Method max(FF)F does not contain a LocalVariableTable entry (source not compiled with -g) codegen will attempt to create a synthetic table based on bytecode. This is experimental!! VIII 09, 2022 9:03:02 PM com.aparapi.internal.kernel.KernelRunner fallBackToNextDevice WARNING: Device failed for NeuralNetwork$2, devices={NVIDIA|Intel|Java Alternative Algorithm|Java Thread Pool}: null
So it looks like poolKernel can't resolve the max function and the whole thing falls back to CPU.
When debugging, I can confirm that it only uses 12 threads - the amount supported by my Intel Core i7. The GPU is an NVIDIA GeForce GTX 1650 with 896 cores so that's what I would expect to see.
Also, at the end it says:
WARNING: Aparapi is running on an untested OpenCL platform version: OpenCL 3.0 CUDA 11.3.123 WARNING: Aparapi is running on an untested OpenCL platform version: OpenCL 3.0
What am I missing? P.S.: As you would imagine, I'm new to both conv nets and GPGPU. I know there's a library that contains all needed cnn functions(cudnn) but I want to implement it by myself to really understand how it works.