In my kernel I compare two large int[,] lemmaA and lemmaB with each other. they are allocated to the GPU by gpu.Allocate(). my kernel looks like:
private static void Kernel(int[,] lemmaA, int[,] lemmaB, int[] result, int L, int x)
{
var start = blockIdx.x * blockDim.x + threadIdx.x;
var stride = gridDim.x * blockDim.x;
for (var i = start; i < L; i += stride)
{
result[i] = Calculate(lemmaA, lemmaB, x, i);
}
}
public static int Calculate(int[,] lemma1, int[,] lemma2, int x, int i)
{
int result = 0;
for(int z = 0; z < 40; z++)
{
int c1 = lemma1[x, z];
int c2 = lemma2[i, z];
r += DoSomething(c1,c2);
}
return result;
}
In the Calculate method I only use a int[] row/array in each int[,] array and I am wondering if I could get an faster execution if I allocated each row/int[] to a local array and did the calculation with the local arrays.
But how can I copy a row/int[] from the int[,] in the kernel?
private static void Kernel(int[,] lemmaA, int[,] lemmaB, int[] result, int L, int x)
{
var start = blockIdx.x * blockDim.x + threadIdx.x;
var stride = gridDim.x * blockDim.x;
for (var i = start; i < L; i += stride)
{
int[] lemma1 = __local__.Array<int>(40);
COPY(lemma1, lemmaA, a,b,c,d); // <- What to do here ??
result[i] = Calculate(lemma1, lemma2);
}
}
public static int Calculate(int[] lemma1, int[] lemma2)
{}