How to improve performance with caching technique

Question

Hello I am trying to run a program that finds closest pair using brute force with caching techniques like the pdf here: Caching Performance Stanford

My original code is:

float compare_points_BF(int N,point *P){
    int i,j;
    float  distance=0, min_dist=FLT_MAX;
    point *p1, *p2;
    unsigned long long calc = 0;
    for (i=0;i<(N-1);i++){
        for (j=i+1;j<N;j++){
            if ((distance = (P[i].x - P[j].x) * (P[i].x - P[j].x) +
                    (P[i].y - P[j].y) * (P[i].y - P[j].y)) < min_dist){
            min_dist = distance;
            p1 = &P[i];
            p2 = &P[j];
            }
        }
    }
    return sqrt(min_dist);
}

This program gives approximately these running times:

      N     8192    16384   32768   65536   131072  262144  524288  1048576      
 seconds    0,070   0,280   1,130   5,540   18,080  72,838  295,660 1220,576
            0,080   0,330   1,280   5,190   20,290  80,880  326,460 1318,631

The cache version of the above program is:

float compare_points_BF(register int N, register int B, point *P){
    register int i, j, ib, jb, num_blocks = (N + (B-1)) / B;
    register point *p1, *p2;
    register float distance=0, min_dist=FLT_MAX, regx, regy;

    //break array data in N/B blocks, ib is index for i cached block and jb is index for j strided cached block
    //each i block is compared with the j block, (which j block is always after the i block) 
    for (i = 0; i < num_blocks; i++){
        for (j = i; j < num_blocks; j++){
            //reads the moving frame block to compare with the i cached block
            for (jb = j * B; jb < ( ((j+1)*B) < N ? ((j+1)*B) : N); jb++){
                //avoid float comparisons that occur when i block = j block
                //Register Allocated
                regx = P[jb].x;
                regy = P[jb].y;
                for (i == j ? (ib = jb + 1) : (ib = i * B); ib < ( ((i+1)*B) < N ? ((i+1)*B) : N); ib++){
                    //calculate distance of current points
                    if((distance = (P[ib].x - regx) * (P[ib].x - regx) +
                            (P[ib].y - regy) * (P[ib].y - regy)) < min_dist){
                        min_dist = distance;
                        p1 = &P[ib];
                        p2 = &P[jb];
                    }
                }
            }
        }
    }
    return sqrt(min_dist);
}

and some results:

Block_size = 256        N = 8192        Run time: 0.090 sec
Block_size = 512        N = 8192        Run time: 0.090 sec
Block_size = 1024       N = 8192        Run time: 0.090 sec
Block_size = 2048       N = 8192        Run time: 0.100 sec
Block_size = 4096       N = 8192        Run time: 0.090 sec
Block_size = 8192       N = 8192        Run time: 0.090 sec


Block_size = 256        N = 16384       Run time: 0.357 sec
Block_size = 512        N = 16384       Run time: 0.353 sec
Block_size = 1024       N = 16384       Run time: 0.360 sec
Block_size = 2048       N = 16384       Run time: 0.360 sec
Block_size = 4096       N = 16384       Run time: 0.370 sec
Block_size = 8192       N = 16384       Run time: 0.350 sec
Block_size = 16384      N = 16384       Run time: 0.350 sec

Block_size = 128        N = 32768       Run time: 1.420 sec
Block_size = 256        N = 32768       Run time: 1.420 sec
Block_size = 512        N = 32768       Run time: 1.390 sec
Block_size = 1024       N = 32768       Run time: 1.410 sec
Block_size = 2048       N = 32768       Run time: 1.430 sec
Block_size = 4096       N = 32768       Run time: 1.430 sec
Block_size = 8192       N = 32768       Run time: 1.400 sec
Block_size = 16384      N = 32768       Run time: 1.380 sec

Block_size = 256        N = 65536       Run time: 5.760 sec
Block_size = 512        N = 65536       Run time: 5.790 sec
Block_size = 1024       N = 65536       Run time: 5.720 sec
Block_size = 2048       N = 65536       Run time: 5.720 sec
Block_size = 4096       N = 65536       Run time: 5.720 sec
Block_size = 8192       N = 65536       Run time: 5.530 sec
Block_size = 16384      N = 65536       Run time: 5.550 sec

Block_size = 256        N = 131072      Run time: 22.750 sec
Block_size = 512        N = 131072      Run time: 23.130 sec
Block_size = 1024       N = 131072      Run time: 22.810 sec
Block_size = 2048       N = 131072      Run time: 22.690 sec
Block_size = 4096       N = 131072      Run time: 22.710 sec
Block_size = 8192       N = 131072      Run time: 21.970 sec
Block_size = 16384      N = 131072      Run time: 22.010 sec

Block_size = 256        N = 262144      Run time: 90.220 sec
Block_size = 512        N = 262144      Run time: 92.140 sec
Block_size = 1024       N = 262144      Run time: 91.181 sec
Block_size = 2048       N = 262144      Run time: 90.681 sec
Block_size = 4096       N = 262144      Run time: 90.760 sec
Block_size = 8192       N = 262144      Run time: 87.660 sec
Block_size = 16384      N = 262144      Run time: 87.760 sec

Block_size = 256        N = 524288      Run time: 361.151 sec
Block_size = 512        N = 524288      Run time: 379.521 sec
Block_size = 1024       N = 524288      Run time: 379.801 sec

From what we can see the running time is slower than the non-cached code. Is this due to compiler optimization? Is the code bad or is it just because of the algorithm that does not perform well with tiling? I use VS 2010 compiled with 32bit executable. Thanks in advance!

@SJuan76 My cpu is a `i7 980x. It has 32KB L1 data, 32KB L1 instruction per core L2 cache: 256KB per core, inclusive L3 cache: 12MB accessible by all cores, inclusive.` In case you are wondering, I did tried with less block size and not so many variables with the `register` notation in the code but still no better performance. Is this possible due to register spilling? — BugShotGG, Oct 10 '13 at 22:40
`register` doesn't do anything on modern compilers. The compiler knows better than you and mostly ignores it. — Art, Oct 10 '13 at 22:44
@Art So is it possible at all to get better performance in the code or is Compiler Optimization beyond reach? — BugShotGG, Oct 10 '13 at 22:46
You're using a paper written half-way between the moon landing and today for specific optimizations that were valid back then and try to apply that to modern architectures. I don't say that it all doesn't apply, but I'd be careful. As for specific advice? I have a hard time decoding your code, but the inner loop with all the weird compares and branches doesn't look healthy. I hope you know that your compiler doesn't actually compile the code while you're running, so a one-liner isn't actually faster than properly formatted and split up code. — Art, Oct 10 '13 at 23:06
@Art If you could provide an answer I would happily accept it. I ve added some more comments in the code — BugShotGG, Oct 11 '13 at 00:12
[ulrich drepper on memory](http://www.akkadia.org/drepper/cpumemory.pdf) is about six years old but is still relevant to modern commodity cpus. — jim mcnamara, Oct 11 '13 at 00:26

score 1 · Answer 1 · answered Oct 11 '13 at 06:22

Tiling may be an old concept, but it's still very relevant today. In your original piece of code, for each i, you may be able to reuse most of the P[j] elements while still cached, but only if the length of the inner loop was small enough to fit there. The actual size should be determined by which cache level you want to target for tiling - the L1 would provide best performance as it's fastest, but as it's also the smallest you'd need small blocks and the tiling overhead may be too much. The L2 allows bigger tiles but sligthly reduced performance, and so on.

Note that you don't need to use 2d tiling here, this is not matrix multiplication - you're traversing over the same array. You could simply tile the inner loop since it's the one overflowing the cache, once you've done that - the outer loop (i) can run all the way to the end on the current block of cached inner-loop elements. There's actually no sense in 2d tiling since no one is going to reuse the elements of the outer loop (as opposed to matrix mul)

So, assuming Point is 64 bit large, you can fit 512 such elements of the array safely in your 32k L1, or 4096 elements in your 256k L2. you'll have to miss once for P[i] on each block if i is out of bounds of the current j block, but that's negligible.

By the way - this explanation may still be obsolete, since a sufficiently good compiler might try to do all this for you. It's quite complicated though, so i'm a bit skeptic any of the common ones would even try, but it should be easy to prove here that reordering is safe. One might argue of course that a "sufficiently good compiler" is a paradox, but that's off topic...

X.J · Accepted Answer · 2013-10-11T22:57:04.573

This is an interesting case. The compiler did a poor job of loop invariant hoisting in the two inner loops. Namely, the two inner for-loop checks the following condition in each iteration:

(j+1)*B) < N ? ((j+1)*B) : N

and

(i+1)*B) < N ? ((i+1)*B) : N

The calculation and branching are both expensive; but they are actually loop invariant for the two inner for-loops. Once manually hoisting them out of the two inner for-loops, I was able to get the cache optimized version to perform better than the unoptimized version (10% when N==524288, 30% when N=1048576).

Here is the modified code (simple change really, look for u1, u2):

//break array data in N/B blocks, ib is index for i cached block and jb is index for j strided cached block
//each i block is compared with the j block, (which j block is always after the i block) 
for (i = 0; i < num_blocks; i++){
    for (j = i; j < num_blocks; j++){
        int u1 =  (((j+1)*B) < N ? ((j+1)*B) : N);
        int u2 =  (((i+1)*B) < N ? ((i+1)*B) : N);
        //reads the moving frame block to compare with the i cached block
        for (jb = j * B; jb < u1 ; jb++){
            //avoid float comparisons that occur when i block = j block
            //Register Allocated
            regx = P[jb].x;
            regy = P[jb].y;
            for (i == j ? (ib = jb + 1) : (ib = i * B); ib < u2; ib++){
                //calculate distance of current points
                if((distance = (P[ib].x - regx) * (P[ib].x - regx) +
                        (P[ib].y - regy) * (P[ib].y - regy)) < min_dist){
                    min_dist = distance;
                    p1 = &P[ib];
                    p2 = &P[jb];
                }
            }
        }
    }
}

hmm u got a point here. How do you propose to get them out of the loop? Can you post your code here? — BugShotGG, Oct 11 '13 at 22:52
Just confirmed it as well. We get slightly better performance if i get the condition out of the loop invariant. You can post your code if you like just to have a better view :) — BugShotGG, Oct 11 '13 at 23:05
If you can think of any other modification to get better performance please share it with us here :) — BugShotGG, Oct 11 '13 at 23:09
You're not going to get much (if at all) from it. Once it's hoisted out of the two inner-loops, it'll be really insignificant. The key to optimization is always the inner loops (besides algorithm level change of course). In your case the inner loop code is quite simple and there is not much to be done here. I'd look into either parallelization (the algorithm should be easily parallelizable), or SSE if you really need to stick to single thread. — X.J, Oct 11 '13 at 23:31

How to improve performance with caching technique

2 Answers2