Improve C function performance with cache locality?

Question

I have to find a diagonal difference in a matrix represented as 2d array and the function prototype is

int diagonal_diff(int x[512][512])

I have to use a 2d array, and the data is 512x512. This is tested on a SPARC machine: my current timing is 6ms but I need to be under 2ms.

Sample data:

[3][4][5][9]
[2][8][9][4]
[6][9][7][3]
[5][8][8][2]

The difference is:

|4-2| + |5-6| + |9-5| + |9-9| + |4-8| + |3-8| = 2 + 1 + 4 + 0 + 4 + 5 = 16

In order to do that, I use the following algorithm:

int i,j,result=0;
for(i=0; i<4; i++)
    for(j=0; j<4; j++)
        result+=abs(array[i][j]-[j][i]);

return result;

But this algorithm keeps accessing the column, row, column, row, etc which make inefficient use of cache.

Is there a way to improve my function?

Did you benchmark or profile this? How big are the real matrices? Any 4 by 4 matrix will fit in the cache and it's irrelevant what order you access the items in. — André Caron, Oct 12 '11 at 02:46
Even if you do this 50,000,000 times per second, not even a low-end modern CPU will break a sweat. Even the function call to `abs()` will get optimized away as intrinsic by most compilers (including GCC and VC++.) — Jonathan Grynspan, Oct 12 '11 at 02:49
the size of array is 512x512 and I have to use a 2D array. the interface specifications is fixed, I just have to fill in the implementations.int diagonal_diff(int x[512][512], int y[512][512]) — Christoper Hans, Oct 12 '11 at 03:12
the function is tested on a SPARC and my current timimng is 6ms, i need to be under 2ms — Christoper Hans, Oct 12 '11 at 03:16
Are the aggressive compiler optimizations -- e.g. "-O6" -- being used? — , Oct 12 '11 at 03:27

MSN · Accepted Answer · 2011-10-13T05:09:35.443

7

EDIT: Why is a block oriented approach faster? We are taking advantage of the CPU's data cache by ensuring that whether we iterate over a block by row or by column, we guarantee that the entire block fits into the cache.

For example, if you have a cache line of 32-bytes and an int is 4 bytes, you can fit a 8x8 int matrix into 8 cache lines. Assuming you have a big enough data cache, you can iterate over that matrix either by row or by column and be guaranteed that you do not thrash the cache. Another way to think about it is if your matrix fits in the cache, you can traverse it any way you want.

If you have a matrix that is much bigger, say 512x512, then you need to tune your matrix traversal such that you don't thrash the cache. For example, if you traverse the matrix in the opposite order of the layout of the matrix, you will almost always miss the cache on every element you visit.

A block oriented approach ensures that you only have a cache miss for data you will eventually visit before the CPU has to flush that cache line. In other words, a block oriented approach tuned to the cache line size will ensure you don't thrash the cache.

So, if you are trying to optimize for the cache line size of the machine you are running on, you can iterate over the matrix in block form and ensure you only visit each matrix element once:

int sum_diagonal_difference(int array[512][512], int block_size)
{
    int i,j, block_i, block_j,result=0;

     // sum diagonal blocks
    for (block_i= 0; block_i<512; block_i+= block_size)
        for (block_j= block_i + block_size; block_j<512; block_j+= block_size)
            for(i=0; i<block_size; i++)
                for(j=0; j<block_size; j++)
                    result+=abs(array[block_i + i][block_j + j]-array[block_j + j][block_i + i]);

    result+= result;

     // sum diagonal
    for (int block_offset= 0; block_offset<512; block_offset+= block_size)
    {
        for (i= 0; i<block_size; ++i)
        {
            for (j= i+1; j<block_size; ++j)
            {
                int value= abs(array[block_offset + i][block_offset + j]-array[block_offset + j][block_offset + i]);
                result+= value + value;
            }
        }
    }

    return result;
}

You should experiment with various values for block_size. On my machine, 8 lead to the biggest speed up (2.5x) compared to a block_size of 1 (and ~5x compared to the original iteration over the entire matrix). The block_size should ideally be cache_line_size_in_bytes/sizeof(int).

edited Oct 13 '11 at 05:09

answered Oct 12 '11 at 04:44

MSN

53,214
7
75
105

On my particular machine, this runs 50% faster than the non-cache aware (Blastfurnace's) version with `blocksize = 8`, which is the fastest I can get. Also comes in at just under half a millisecond to execute. – Mike Bailey Oct 12 '11 at 05:35
this method works! thanks a lot ! there are few results error, caused by: result+=result; in line 12 and: result+= value+value; in line 22. I changed it to use single result rather than double (which is what @MSN did) and it works perfectly. – Christoper Hans Oct 12 '11 at 09:16
On my test, block_size = 16 is the fastest I can get. The method goes ~80% faster than original one. – Christoper Hans Oct 12 '11 at 09:28
@ChristoperHans, are you saying there is overflow (yes) or that the result is incorrect? I tested this locally with the naive version you posted and they came out with the same value. – MSN Oct 12 '11 at 15:37
Mathematically, MSN's solution produces the correct result by counting each pair twice. Blastfurnace's solution actually produces a result that's exactly half of what the naive solution should output. – Mike Bailey Oct 13 '11 at 02:45
I tested it with this array, and the halved result (with the changes I made) is correct. there is no overflow at the method – Christoper Hans Oct 13 '11 at 03:53
I still don't understand why the method I posted before is slower compared to @MSN's, even though we have same amount of array computation (the abs() thingy). How did you make the method faster, even though it is accessing rows and columns of the array? IMO, accessing both rows and columns is the problem that slows my method. – Christoper Hans Oct 13 '11 at 04:47
@ChristoperHans, it's faster because it doesn't thrash the cache. Traversing a matrix in the opposite order of its layout, especially for such a big matrix relative to the cache line size, is guaranteed to thrash the cache, i.e., you will almost never take advantage of cache coherency because data accesses are too far apart. Plus, the sparc L1 data cache is probably 64k, whereas the entire 512x512 matrix is 1MB. – MSN Oct 13 '11 at 05:11
I still wonder why it doesn't trash the cache. I know that the algorithm would use the cache locality efficiently, but how if it is accessing array[511][0] and array[0][511]? Wouldn't that be the same as thrashing the cache? Or the algorithm would still be faster since for the long run, it doesn't thrash as much as mine does? – Christoper Hans Oct 13 '11 at 11:37
@ChristopherHans, obviously it doesn't. Try this: take a 8x8 matrix and write out the memory accesses. Now, simulate the cache behavior if you have a 2-entry cache line and 4 cache lines available. The naive approach will flush the cache much more than a 2x2 block oriented approach. And cache flushes lead to cache misses which are very expensive. – MSN Oct 13 '11 at 16:37

score 3 · Answer 2 · edited Dec 12 '12 at 11:22

3

If you have a good vector/matrix library like intel MKL, also try the vectorized way.

very simple in matlab: result = sum(sum(abs(x-x')));

I reproduced Hans's method and MSN's method in matlab too, and the results are:

Elapsed time is 0.211480 seconds.  (Hans)

Elapsed time is 0.009172 seconds.  (MSN)

Elapsed time is 0.002193 seconds.  (Mine)

edited Dec 12 '12 at 11:22

fancyPants

50,732
33
89
96

answered Dec 12 '12 at 11:01

Desmond

31
1

Blastfurnace · Answer 3 · 2011-10-12T05:07:37.467

1

With one minor change you can have your loops only operate on the desired indices. I just changed the j loop initialization.

int i, j, result = 0;
for (i = 0; i < 4; ++i) {
    for (j = i + 1; j < 4; ++j) {
        result += abs(array[i][j] - array[j][i]);
    }
}

edited Oct 12 '11 at 05:07

answered Oct 12 '11 at 03:41

Blastfurnace

18,411
56
55
70

I did that but that doesn't improve much.I also try to copy the column to another 1-dimenstional array before the comparison. timed about 4ms – Christoper Hans Oct 12 '11 at 03:44
2

the `i != j` is unnecessary here. Because you initialize `j = i + 1`, j can never be equal to i. – Mike Bailey Oct 12 '11 at 04:51
@Mike Bantegui, you are correct and I feel a little bit foolish. Thanks. – Blastfurnace Oct 12 '11 at 05:08

Improve C function performance with cache locality?

3 Answers3

Linked