No performance gain with transpose of large 2d Matrix using Loop tiling

Question

Transposing global 2D Square Matrix/Array of size 1 gb with tiling approach(Cache Aware) has no performance gain in single threaded execution over Normal transpose method. Not discussing the transpose speed up using AVX,SSE(SIMD) or any other cache oblivious transpose algorithm(http://supertech.csail.mit.edu/papers/FrigoLePr12.pdf)

#include <stdio.h>
#include <sys/time.h>
#define SIZE 16384
float a[SIZE][SIZE], b[SIZE][SIZE];

void testNormalTranspose() {
int i, j, k, l;
b[0][9999] = 1.0;
for (i=0; i<SIZE; i++)
    for (j=0; j<SIZE; j++)
      a[i][j] = b[j][i];
}

void testTiledTranspose(){
    int i, j, k, l;
    b[0][9999] = 1.0;
    int blocksize = 16;
    for (i=0; i<SIZE; i+= blocksize) {
        for (j=0; j<SIZE; j+=blocksize) {
            for (int ii = i;ii <i + blocksize; ++ii) {
                for (int jj = j; jj < j + blocksize; ++jj) {
                    a[ii][jj] = b[jj][ii];
                }

            }
        }   
    }  
}

int main()
{
    struct timeval t1, t2;
    /*
      gettimeofday(&t1, NULL);
      testNormalTranspose();
      gettimeofday(&t2, NULL);
      printf("Time for the Normal transpose  is %ld milliseconds\n",
             (t2.tv_sec - t1.tv_sec)*1000 + 
             (t2.tv_usec - t1.tv_usec) / 1000);
    */
      gettimeofday(&t1, NULL);
      testTiledTranspose();
      gettimeofday(&t2, NULL);
      printf("Time for the Tiled transpose  is %ld milliseconds\n",
             (t2.tv_sec - t1.tv_sec)*1000 + 
             (t2.tv_usec - t1.tv_usec) / 1000);
      printf("%f\n", a[9999][0]);
}

If you do not like to talk about cache cohercency, what ist your question and why shold one approach be fast er than the other. — schorsch312, Sep 02 '17 at 08:28
Tiling provides spatial locality. How doesn't it help improve performance in the above method :testTiledTranspose — Dhiraj, Sep 02 '17 at 08:45
Cannot reproduce failure. All tests I do provide significant performance improvement (2.5..3.2 times better). Something else is going on. — Dúthomhas, Sep 02 '17 at 09:46
U can't run both the code together since the last run will push some of matrix in cache hence whoever runs second ll have speed up. First run NormalTranspose and comment TiledTranspose then vice versa — Dhiraj, Sep 02 '17 at 09:48
1. No gain on Ubuntu gcc with 8192 but sees 10/30% improvement for smaller size(8192 to 2048).The code doesnt compile on windows(cygwin,mingw) where its impossible to allocate 2gb in global bss. But with smaller array size, there is significant performance gain for tiling approach on windows. Trying with latest Linux subsystem(Ubuntu) on Windows10 anniversary update also showed the gain with tiling. — Dhiraj, Sep 02 '17 at 21:03

Andriy Berestovskyy · Answer 1 · 2017-09-08T17:04:56.787

Loop tiling helps in case the data is being reused. If you use an element SIZE times, you better use it SIZE times and only then proceed to a next element.

Unfortunately, transposing 2D matrix you are not reusing any elements neither of matrix a, nor b. Even more, since in the loop you mix rows and cols access (i.e. a[i][j] = b[j][i]), you will never get unit-stride memory access on both a and b arrays at the same time, but only on one of them.

So, in this case tiling is not that efficient, but still you might have some performance improvements even with "random" memory access if:

the element you are accessing now is on the same cache line with an element you were accessing previously AND
that cache line is still available.

So, to see any improvements the memory footprint of this "random" accesses must fit into the cache of your system. Basically this means you have to carefully chose the blocksize and 16 you have chosen in the example might work better on one system and worse on the other.

Here are the results from my computer for different power of 2 block sizes and SIZE 4096:

---------------------------------------------------------------
Benchmark                        Time           CPU Iterations
---------------------------------------------------------------
transpose_2d              32052765 ns   32051761 ns         21
tiled_transpose_2d/2      22246701 ns   22245867 ns         31
tiled_transpose_2d/4      16912984 ns   16912487 ns         41
tiled_transpose_2d/8      16284471 ns   16283974 ns         43
tiled_transpose_2d/16     16604652 ns   16604149 ns         42
tiled_transpose_2d/32     23661431 ns   23660226 ns         29
tiled_transpose_2d/64     32260575 ns   32259564 ns         22
tiled_transpose_2d/128    32107778 ns   32106793 ns         22
fixed_tile_transpose_2d   16735583 ns   16729876 ns         41

As you can see the version with blocksize 8 works the best for me and it almost double the performance.

Here are the results for SIZE 4131 and power of 3 block sizes:

---------------------------------------------------------------
Benchmark                        Time           CPU Iterations
---------------------------------------------------------------
transpose_2d              29875351 ns   29874381 ns         23
tiled_transpose_2d/3      30077471 ns   30076517 ns         23
tiled_transpose_2d/9      20420423 ns   20419499 ns         35
tiled_transpose_2d/27     13470242 ns   13468992 ns         51
tiled_transpose_2d/81     11318953 ns   11318646 ns         61
tiled_transpose_2d/243    10229250 ns   10228884 ns         65
fixed_tile_transpose_2d   10217339 ns   10217066 ns         67

Regarding 16384 size issue. I can not reproduce it, i.e. I still see the same gain for big matrix. Just please note, that 16384 * 16384 * sizeof(float) makes 4GB, which might expose some system issues...

Your explanation is correct, but you might want to try non powers-of-two tile sizes / matrix sizes, as matrix transpose is known to be one of the worst case for cache line conflicts. — Nonyme, Sep 08 '17 at 16:36
@Nonyme I just tried power of 3 -- the results are quite the same: tiling helps, but you must find the "sweet spot". Updated the answer... — Andriy Berestovskyy, Sep 08 '17 at 17:05
Thanks for those additional tests. It might be worth mentioning that the memory access might get predicted and cache prefetching would therefore already take care of most of the problem. — Nonyme, Sep 08 '17 at 17:39
@Nonyme yes, prefetchers are quite good nowadays... Please also note that in your example matrix `SIZE` must be a multiply of `blocksize`. Remainder handling is required otherwise... — Andriy Berestovskyy, Sep 08 '17 at 17:56

No performance gain with transpose of large 2d Matrix using Loop tiling

1 Answers1