The variation of cache misses in GPU

Question

I have been toying an OpenCL kernel that access 7 global memory buffers, do something on the values and store the result back to a 8th global memory buffer. As I observed, as the input size increases, the L1 cache miss ratio (=misses(misses + hits)) varies a lot. I can't find the source of this variation. The input size here means the number of global work items (a power of 2, and a multiple of workgroup size). The number of workgroup size remains 256.

These are the results. These show the L1 cache miss ratio. Starting from 4096 work-items (16 workgroups).

0.677125
0.55946875
0.345994792
0.054078125
0.436167969
0.431871745
0.938546224
0.959258789
0.952941406
0.955016479

The profiler says it uses 18 registers per thread. Here is the code (the function TTsum() is supposed to do just a bunch of dependent transcendent operations, so it has nothing to do with caches I guess) :

float TTsum(float x1, float x2, float x3, float x4, float x5, float x6, float x7)
{
        float temp = 0;
        for (int j = 0; j < 2; j++)
                temp = temp +  x1 + (float)x2 + x3 + x4 + x5 + x6 + x7;
        temp = sqrt(temp);
        temp = exp(temp);
        temp = temp / x1;
        temp = temp / (float)x2;
        for (int j = 0; j < 20; j++) temp = sqrt(temp);
        return temp;
}

__kernel void histogram(__global float* x1,
                        __global int* x2,
                        __global float* x3,
                        __global float* x4,
                        __global float* x5,
                        __global float* x6,
                        __global float* x7,
                        __global float* y)
{
  int id = get_global_id(0);    
  for (int j = 0; j < 1000; j++)
    y[id] = TTsum(x1[id], x2[id], x3[id], x4[id], x5[id], x6[id], x7[id]);
}

Can someone explain the cache behavior? The experiments are done in GTX580.

You should show the code to TTsum, because if its varibles are not held in register, it will also be using L2 cache. — talonmies, Jul 19 '11 at 14:52
I edited the question. Do you mean that L2 is used for register spilling? I thought it was L1? By the way all the values here are just for L1 cache only. — Zk1001, Jul 19 '11 at 15:11
When variables are spilled from register, they wind up in "thread local" memory which is stored in SDRAM and read through the normal global memory L1/L2 cache structure. So cache misses can be generated by thread local memory access. — talonmies, Jul 19 '11 at 18:09
I am guessing you are using random input data. I would suggest trying the kernel without the exp and sqrt calls. Exp is a function which may well be using thread local memory and won't have a constant runtime, depending on the input. You might see something rather different if the kernel is restricted to a sequence of constant time, register based multiply-add operations. — talonmies, Jul 20 '11 at 07:16
Ok. I will try MADDs only. Are you suspecting that those transcendent functions using global memory? How could it be? — Zk1001, Jul 20 '11 at 07:38
Well I removed all the exp's and sqrt's in the test, keeping only add's (the first for loop). But the cache behaviour is similar. Not only the cache misses, but also the %replays also varies a lot across the inputs. — Zk1001, Jul 20 '11 at 07:57
Describe how do you perform testing in detail. Do they run as a single batch, or a separate runs? — Petr Abdulin, Jul 26 '11 at 02:30
Yes, I used a shell script to execute all the runs in one batch. Any other details do you want to know? — Zk1001, Jul 26 '11 at 04:57
Weird! Are the global memory buffers distinct? Does the cache miss ratio correlate with e.g. distance(x1, x2)? — Sebastian, Aug 03 '11 at 13:53
It's quite hard to calculate histograms in CUDA. I believe the random access on y[] may very well be the cause for the behaviour you observe. Maybe read this if you haven't: http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/histogram256/doc/histogram.pdf — whoplisp, Aug 03 '11 at 21:15
Yes the global memories are distinct. Why would the cache miss ratio correlate with the distance(x1, x2)? I don't really think so. Could you explain a little more? — Zk1001, Aug 05 '11 at 04:33

score 3 · Answer 1 · answered Aug 03 '11 at 21:16

3

It's quite hard to calculate histograms in CUDA. I believe the random access on y[] may very well be the cause for the behaviour you observe. Maybe read this if you haven't: http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/histogram256/doc/histogram.pdf

answered Aug 03 '11 at 21:16

whoplisp

2,508
16
19

Thanks for the reference, I will read it. By the way, the access on y[] is quite regular, isn't it? And I just write on y[], not reading it at all. And I'm sorry I might not mention in the question that all the misses are cache miss read. – Zk1001 Aug 05 '11 at 04:36

The variation of cache misses in GPU

1 Answers1