I have been toying an OpenCL kernel that access 7 global memory buffers, do something on the values and store the result back to a 8th global memory buffer. As I observed, as the input size increases, the L1 cache miss ratio (=misses(misses + hits)) varies a lot. I can't find the source of this variation. The input size here means the number of global work items (a power of 2, and a multiple of workgroup size). The number of workgroup size remains 256.
These are the results. These show the L1 cache miss ratio. Starting from 4096 work-items (16 workgroups).
0.677125
0.55946875
0.345994792
0.054078125
0.436167969
0.431871745
0.938546224
0.959258789
0.952941406
0.955016479
The profiler says it uses 18 registers per thread. Here is the code (the function TTsum() is supposed to do just a bunch of dependent transcendent operations, so it has nothing to do with caches I guess) :
float TTsum(float x1, float x2, float x3, float x4, float x5, float x6, float x7)
{
float temp = 0;
for (int j = 0; j < 2; j++)
temp = temp + x1 + (float)x2 + x3 + x4 + x5 + x6 + x7;
temp = sqrt(temp);
temp = exp(temp);
temp = temp / x1;
temp = temp / (float)x2;
for (int j = 0; j < 20; j++) temp = sqrt(temp);
return temp;
}
__kernel void histogram(__global float* x1,
__global int* x2,
__global float* x3,
__global float* x4,
__global float* x5,
__global float* x6,
__global float* x7,
__global float* y)
{
int id = get_global_id(0);
for (int j = 0; j < 1000; j++)
y[id] = TTsum(x1[id], x2[id], x3[id], x4[id], x5[id], x6[id], x7[id]);
}
Can someone explain the cache behavior? The experiments are done in GTX580.