18

I have been toying an OpenCL kernel that access 7 global memory buffers, do something on the values and store the result back to a 8th global memory buffer. As I observed, as the input size increases, the L1 cache miss ratio (=misses(misses + hits)) varies a lot. I can't find the source of this variation. The input size here means the number of global work items (a power of 2, and a multiple of workgroup size). The number of workgroup size remains 256.

These are the results. These show the L1 cache miss ratio. Starting from 4096 work-items (16 workgroups).

0.677125
0.55946875
0.345994792
0.054078125
0.436167969
0.431871745
0.938546224
0.959258789
0.952941406
0.955016479

The profiler says it uses 18 registers per thread. Here is the code (the function TTsum() is supposed to do just a bunch of dependent transcendent operations, so it has nothing to do with caches I guess) :

float TTsum(float x1, float x2, float x3, float x4, float x5, float x6, float x7)
{
        float temp = 0;
        for (int j = 0; j < 2; j++)
                temp = temp +  x1 + (float)x2 + x3 + x4 + x5 + x6 + x7;
        temp = sqrt(temp);
        temp = exp(temp);
        temp = temp / x1;
        temp = temp / (float)x2;
        for (int j = 0; j < 20; j++) temp = sqrt(temp);
        return temp;
}

__kernel void histogram(__global float* x1,
                        __global int* x2,
                        __global float* x3,
                        __global float* x4,
                        __global float* x5,
                        __global float* x6,
                        __global float* x7,
                        __global float* y)
{
  int id = get_global_id(0);    
  for (int j = 0; j < 1000; j++)
    y[id] = TTsum(x1[id], x2[id], x3[id], x4[id], x5[id], x6[id], x7[id]);
}

Can someone explain the cache behavior? The experiments are done in GTX580.

Zk1001
  • 2,033
  • 4
  • 19
  • 36
  • +1 looks like an epic question, nice job! :) – user541686 Jul 19 '11 at 14:42
  • You should show the code to TTsum, because if its varibles are not held in register, it will also be using L2 cache. – talonmies Jul 19 '11 at 14:52
  • I edited the question. Do you mean that L2 is used for register spilling? I thought it was L1? By the way all the values here are just for L1 cache only. – Zk1001 Jul 19 '11 at 15:11
  • When variables are spilled from register, they wind up in "thread local" memory which is stored in SDRAM and read through the normal global memory L1/L2 cache structure. So cache misses can be generated by thread local memory access. – talonmies Jul 19 '11 at 18:09
  • I am guessing you are using random input data. I would suggest trying the kernel without the exp and sqrt calls. Exp is a function which may well be using thread local memory and won't have a constant runtime, depending on the input. You might see something rather different if the kernel is restricted to a sequence of constant time, register based multiply-add operations. – talonmies Jul 20 '11 at 07:16
  • Ok. I will try MADDs only. Are you suspecting that those transcendent functions using global memory? How could it be? – Zk1001 Jul 20 '11 at 07:38
  • Well I removed all the exp's and sqrt's in the test, keeping only add's (the first for loop). But the cache behaviour is similar. Not only the cache misses, but also the %replays also varies a lot across the inputs. – Zk1001 Jul 20 '11 at 07:57
  • Describe how do you perform testing in detail. Do they run as a single batch, or a separate runs? – Petr Abdulin Jul 26 '11 at 02:30
  • Yes, I used a shell script to execute all the runs in one batch. Any other details do you want to know? – Zk1001 Jul 26 '11 at 04:57
  • Weird! Are the global memory buffers distinct? Does the cache miss ratio correlate with e.g. distance(x1, x2)? – Sebastian Aug 03 '11 at 13:53
  • It's quite hard to calculate histograms in CUDA. I believe the random access on y[] may very well be the cause for the behaviour you observe. Maybe read this if you haven't: http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/histogram256/doc/histogram.pdf – whoplisp Aug 03 '11 at 21:15
  • Yes the global memories are distinct. Why would the cache miss ratio correlate with the distance(x1, x2)? I don't really think so. Could you explain a little more? – Zk1001 Aug 05 '11 at 04:33

1 Answers1

3

It's quite hard to calculate histograms in CUDA. I believe the random access on y[] may very well be the cause for the behaviour you observe. Maybe read this if you haven't: http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/histogram256/doc/histogram.pdf

whoplisp
  • 2,508
  • 16
  • 19
  • Thanks for the reference, I will read it. By the way, the access on y[] is quite regular, isn't it? And I just write on y[], not reading it at all. And I'm sorry I might not mention in the question that all the misses are cache miss read. – Zk1001 Aug 05 '11 at 04:36