0

If I run the following program and then run it again after swapping of i and j in sum+=arr[i][j], the execution time is very different i.e. 9.8 secs compared to 2.7 secs for before the swap. I just cannot understand why it is like this. Can someone please give me any idea about why it is so?

#include<iostream>
#include<time.h>
using namespace std;

int main()
{
    int long sum=0;
    int size = 1024;
    clock_t start, end;
    double msecs;
    start = clock();

    int **arr = new int*[size];
    for (int i = 0; i < size; i++) 
    {
        arr[i] = new int[size];
    }

    for(int kk=0; kk<1000; kk++) 
    {
        sum = 0;
        for (int i = 0; i < size; i++)
        {
            for (int j = 0; j < size ; j++)
            {
                sum += arr[i][j];
            }
        }
    }

    end = clock();  
    msecs = ((double) (end - start)) * 1000 / CLOCKS_PER_SEC;
    cout<<msecs<<endl<<endl;

    return 0;
}
brokenfoot
  • 11,083
  • 10
  • 59
  • 80
Ghias
  • 133
  • 3
  • 18
  • 1
    Google for "cache hits performance" (not related to your question, it is sort of an artistic group that is very good at, well, performing). – SJuan76 Apr 02 '14 at 22:48
  • Also Google for "Data Driven Development" and "optimize array" – Thomas Matthews Apr 02 '14 at 22:49
  • Another very important keyword to search for in this topic is "prefetch". x86 and x64 does that automatically, on some dumber hardware you have to do prefetching manually for yourself with assembly or with intrinsics. Algorithms that work by linearly scanning 1-2 memory regions simply outperform algorithms that work with seemingly "random access" strategy from the viewpoint of the cpu. Often this is why classic linked lists suck hard especially without customized allocators compared to vector based data structures, sometimes a hybrid in between the winner depending on the operations you need. – pasztorpisti Apr 02 '14 at 23:15

1 Answers1

3

That is due to spatial locality. When your program needs some data from the memory, the processor not only reads that specific data but also the neighboring data. So, in the next iteration when you need the next set of data, it is already there in your cache.

In the other case, your program can't take advantage of spatial locality since you are not reading the neighboring data in consecutive iterations.

Say your data is laid out in the memory like:

  0  1  2  3  4  5  6  7  8  9 
 10 11 12 13 14 15 16 17 18 19
 20 21 22 23 24 25 26 27 28 29

When your program needs to read say data labeled 0, it reads the entire row:
0 1 2 3 4 5 6 7 8 9

So, that when you need data labeled 1, it is already in the cache and your program runs faster.

On the contrary if you are reading data column wise, this doesn't help you, each time you get a cache miss and the processor again has to do a memory read.

In short, memory read are costly, this is processor's way of optimizing reads to save time.

brokenfoot
  • 11,083
  • 10
  • 59
  • 80
  • So what I understand is that for example arr[0][0], arr[0][1], arr[0][2] are loaded into the cache but when the program suddenly asks for arr[1][0] its not in the cache and hence has to be loaded from the main memory, and this is making the program slow. Is there anything that I'm missing? – Ghias Apr 02 '14 at 23:25
  • Yes, this is exactly what happens. – brokenfoot Apr 02 '14 at 23:30
  • A good reason to use a 1-D array which you "simulate" a 2-D array on by using the right array indices. "Multi-dimensional arrays" where each row is separately dynamically allocated, and yet the program never needs to make different rows be different lengths, seem to be used way too much IMHO. – M.M Apr 03 '14 at 04:31