reading/writing a matrix with a stride much larger than its width causes a big loss in performance

Question

I'm doing dense matrix multiplication on 1024x1024 matrices. I do this using loop blocking/tiling using 64x64 tiles. I have created a highly optimized 64x64 matrix multiplication function (see the end of my question for the code).

gemm64(float *a, float *b, float *c, int stride).

Here is the code which runs over the tiles. A 1024x1204 matrix which has 16x16 tiles.

for(int i=0; i<16; i++) {
    for(int j=0; j<16; j++) {
        for(int k=0; k<16; k++) {
            gemm64(&a[64*(i*1024 + k)], &b[64*(k*1024 + j)], &c[64*(i*1024 + j)], 1024);
        }
    }
}

However, as a guess, I tried rearranging the memory of all the tiles (see the end of this question for the code) for the b matrix in a new matrix b2 so that the stride of each tile is 64 instead of 1024. This effectively makes an array of 64x64 matrices with stride=64.

float *b2 = (float*)_mm_malloc(1024*1024*sizeof(float), 64);
reorder(b, b2, 1024);
for(int i=0; i<16; i++) {
    for(int j=0; j<16; j++) {
        for(int k=0; k<16; k++) {
            gemm64_v2(&a[64*(i*1024 + k)], &b2[64*64*(k*16 + j)], &c[64*(i*1024 + j)], 64);
        }
    }
}
_mm_free(b2);

Notice how the offset for b changed from &b[64*(k*1024 + j)] to &b2[64*64*(k*16 + j)] and that the stride passed to gemm64 has changed from 1024 to 64.

The performance of my code jumps from less that 20% to 70% of the peak flops on my Sandy Bridge system!

Why does rearranging the tiles in the b matrix in this way make such a huge difference?

The arrays a,b, b2, and c are 64 byte aligned.

extern "C" void gemm64(float *a, float*b, float*c, int stride) {
    for(int i=0; i<64; i++) {
        row_m64x64(&a[1024*i], b, &c[1024*i], stride);
    }
}

void row_m64x64(const float *a, const float *b, float *c, int stride) {
    __m256 tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
    tmp0 = _mm256_loadu_ps(&c[ 0]);
    tmp1 = _mm256_loadu_ps(&c[ 8]);
    tmp2 = _mm256_loadu_ps(&c[16]);
    tmp3 = _mm256_loadu_ps(&c[24]);
    tmp4 = _mm256_loadu_ps(&c[32]);
    tmp5 = _mm256_loadu_ps(&c[40]);
    tmp6 = _mm256_loadu_ps(&c[48]);
    tmp7 = _mm256_loadu_ps(&c[56]);

    for(int i=0; i<64; i++) {
        __m256 areg0 = _mm256_broadcast_ss(&a[i]);

        __m256 breg0 = _mm256_loadu_ps(&b[stride*i +  0]);
        tmp0 = _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0);    
        __m256 breg1 = _mm256_loadu_ps(&b[stride*i +  8]);
        tmp1 = _mm256_add_ps(_mm256_mul_ps(areg0,breg1), tmp1);
        __m256 breg2 = _mm256_loadu_ps(&b[stride*i + 16]);
        tmp2 = _mm256_add_ps(_mm256_mul_ps(areg0,breg2), tmp2);    
        __m256 breg3 = _mm256_loadu_ps(&b[stride*i + 24]);
        tmp3 = _mm256_add_ps(_mm256_mul_ps(areg0,breg3), tmp3);   
        __m256 breg4 = _mm256_loadu_ps(&b[stride*i + 32]);
        tmp4 = _mm256_add_ps(_mm256_mul_ps(areg0,breg4), tmp4);    
        __m256 breg5 = _mm256_loadu_ps(&b[stride*i + 40]);
        tmp5 = _mm256_add_ps(_mm256_mul_ps(areg0,breg5), tmp5);    
        __m256 breg6 = _mm256_loadu_ps(&b[stride*i + 48]);
        tmp6 = _mm256_add_ps(_mm256_mul_ps(areg0,breg6), tmp6);    
        __m256 breg7 = _mm256_loadu_ps(&b[stride*i + 56]);
        tmp7 = _mm256_add_ps(_mm256_mul_ps(areg0,breg7), tmp7);    
    }
    _mm256_storeu_ps(&c[ 0], tmp0);
    _mm256_storeu_ps(&c[ 8], tmp1);
    _mm256_storeu_ps(&c[16], tmp2);
    _mm256_storeu_ps(&c[24], tmp3);
    _mm256_storeu_ps(&c[32], tmp4);
    _mm256_storeu_ps(&c[40], tmp5);
    _mm256_storeu_ps(&c[48], tmp6);
    _mm256_storeu_ps(&c[56], tmp7);
}

The code to reorder the b matrix.

reorder(float *b, float *b2, int stride) {
    //int k = 0;
    for(int i=0; i<16; i++) {
        for(int j=0; j<16; j++) {
            for(int i2=0; i2<64; i2++) {
                for(int j2=0; j2<64; j2++) {
                    //b2[k++] = b[1024*(64*i+i2) + 64*j + j2];
                    b2[64*64*(i*16 + j) + 64*i2+j2] = b[1024*(64*i+i2) + 64*j + j2];
                }
            }
        }
    }
}

This is most likely a false sharing effect, e.g. at stride=1024 the successive matrices evict each other from the cache. — Krzysztof Kosiński, Feb 11 '14 at 14:20
That was my first guess but I'm not sure how to prove it. I changed the stride to 1030 to try and avoid the critical stride and it did not make much of a difference. — Z boson, Feb 11 '14 at 14:22
@KrzysztofKosiński this effect is called cache aliasing. False sharing is a different story. — Marat Dukhan, Feb 11 '14 at 15:02
@Zboson using 64x64 matrix multiplication as a basic primitive is suboptimal. The associativity of L1 cache is at best 8, so you may not keep more than eight 1x64 rows in cache. — Marat Dukhan, Feb 11 '14 at 15:06
@MaratDukhan, what tile size would you suggest? I chose this value purely from too many experiments. I get 70% (though not consistently) of the peak flops which I think is quite good. — Z boson, Feb 11 '14 at 15:10
@Zboson the size depends on several parameters of microarchitecture (cache associativity, register width, register count), and there are additional variations depending on storage format (row-major vs column-major). I suggest that you read the paper from Goto and van de Geijn for details. You may look at `bli_kernel.h` in https://github.com/flame/blis/tree/master/config for good values for modern microarchitectures. — Marat Dukhan, Feb 11 '14 at 20:30
As Marat says, false aliasing is a major risk with large strides. With a stride of 1024, every row of your block belongs to the same L1 set (because the low 12 bits of their addresses are identical). This means that even if you ignore the other source and destination matrices, you cannot possibly keep your whole tile in L1, which is critical for getting good performance. — Stephen Canon, Feb 20 '14 at 13:15
@StephenCanon, I need to look into this again. But I tried setting the stride to (1024+16)=1040 (16 floats to stay 64 byte aligned) to avoid false aliasing and it did not make a difference. I followed Agner Fog's section of his C++ optimizing manual on the critical stride. But maybe I need to think more carefully about it. — Z boson, Feb 20 '14 at 13:29
Setting the stride to 1040 wouldn’t fix the problem because your tile rows are 64 floats wide. The smallest offset that might improve things is 1024 + 64 = 1088. That said, copying tiles into contiguous memory is very much the right thing to do (and the usual approach), so I wouldn’t spend too much time fooling around with weird strides instead. — Stephen Canon, Feb 20 '14 at 13:40
@StephenCanon, well I'm trying to understand the theory. Until recently I did everything by intuition but I don't really understand how everything works. The cache is the last part I feel I don't understand well enough. — Z boson, Feb 20 '14 at 14:05
@MaratDukhan, if I cannot keep more than 8 rows of my 64x64 tiles in the cache how come I'm getting 70% of the peak flops? Wouldn't I be doing a lot worse if I only kept 8/64 rows in the cache? — Z boson, Feb 20 '14 at 14:07
@Zboson, on which matrix size do you get 70%? You can not keep 8 rows in cache for all matrix strides (in particular, when stride is a large power of 2), but for many matrix strides you can cache more than 8 rows. There are also additional details to consider, e.g. L2 and L3 caches, prefetching, Turbo Boost (increases peak FLOPS). — Marat Dukhan, Feb 21 '14 at 02:40
@MaratDukhan, I get 70% for 1024x1024 matrices. My peak flops is based on the frequency from Turbo Boost. I used CPU-Z (or powertop on Linux) to check the frequency when all cores are being used (because the frequency is different for one core or four cores being used). I can also look the numbers up in the documentation (assuming the system is not overclocked). — Z boson, Feb 21 '14 at 07:57
@Zboson, apparently your implementation is very good, but you likely could gain additional performance through better caching. What is the performance of Intel MKL, OpenBLAS and BLIS on the same test case? — Marat Dukhan, Feb 21 '14 at 08:18
@MaratDukhan, last time I checked MKL it was over 80% but less than 90% (I haved not tested it in a while). Some people claim it gets better than 90% but I have not observed that yet. — Z boson, Feb 21 '14 at 08:21
@MaratDukhan. I answered my own question. Based on the Goto paper I think the answer to my question is related to the TLB. If you are interested see my answer and let me know what you think. — Z boson, Mar 19 '14 at 15:34

score 1 · Accepted Answer · answered Feb 11 '14 at 14:45

1

I think the problem lies in the innermost loop, and has to do with prefetching. You are doing the following:

Load 256 consecutive bytes (4 cache lines) from c
Load 256 consecutive bytes from b 64 times (a total of 256 cache lines)
Save 256 consecutive bytes to c

In step 2, when stride is 64, you are effectively reading a long consecutive block of memory. When stride is 1024, you are reading memory in discontinuous steps, 256 bytes at a time.

As a result, when stride is 64, the prefetcher reads successive blocks into the cache ahead of time, and you get at most one cache miss per row. When stride is 1024, the prefetcher is confused by the discontinuous reads (you alternate between reading successive cache lines and widely separated cache lines) and there are 64 cache misses per row.

answered Feb 11 '14 at 14:45

Krzysztof Kosiński

4,225
2
18
23

I'm glad you understand exactly what I'm doing. I was worried the text would not be clear. I don't now how prefetching works but I assumed it would be smart enough to look at the load functions and read in cachd lines based on the stride and not just consecutive cache lines. – Z boson Feb 11 '14 at 15:04
The problem here is that the stride is not constant. You start reading with a stride of 1 cache line, and after reading 4 lines in that fashion you read a line that's far ahead. Probably if you reordered the reads so that first all the stride*i+0 and stride*i+8 locations are read, then stride*i+16 and stride*i+24, and so on, you would have better performance, since the stride would be constant and there would be 4 misses / row. – Krzysztof Kosiński Feb 11 '14 at 15:08
I see what you mean. But that creates another problem. The current method only has to broadcast (read a) once per row. The method you suggest would have to reread from the a matrix four times as much as the current method. I did something like that in the past and the performance was worse. – Z boson Feb 11 '14 at 15:51
I tried your suggestion of reading 16-side columns (instead of 64-wide columns) and it was no better than the default case without reordering. I even unrolled the loop twice so that I was doing four parallel sums (at least three are needed) but it made no difference. – Z boson Feb 12 '14 at 15:28
Hm, maybe the prefetcher doesn't look at cache lines but at loads/stores directly... in that case my suggestion won't help much. You can also try reading every stride*i+n location in a separate loop, rather than reading stride*i+n and stride*i+n+8 in each loop. – Krzysztof Kosiński Feb 13 '14 at 18:15
I answered my own question. I know they the answer is related to the TLB and not the prefetcher. – Z boson Mar 19 '14 at 15:33

score 0 · Answer 2 · answered Mar 19 '14 at 15:33

I think I may have found my answer. I think it's relatted to the Translation Look-aside Buffer (TLB).

In the paper Anatomy of High-Performance Matrix Multiplication by by Goto and Van de Geijn they write

The most significant difference between a cache miss and a TLB miss is that a cache miss does not necessarily stall the CPU...A TLB miss, by contrast, causes the CPU to stall until the TLB has been updated with the new address. In other words, prefetching can mask a cache miss but not a TLB miss."

Shortly after that in section 4.2.3 titled "Packing" they write

The fundamental problem now is that A is typically a submatrix of a larger matrix, and therefore is not contiguous in memory. This in turn means that addressing it requires many more than the minimal number of TLB entries. The solution is to pack A in a contiguous work array

It was true at the time when the paper what written, but nowadays Intel Ivy Bridge and Haswell (and possibly recent AMD processors as well) can handle a TLB miss without blocking execution. — Marat Dukhan, Mar 19 '14 at 16:24
@MaratDukhan, okay, good to know. Then I still am not 100% sure I know why the packing makes such a difference (I only did it because I thought it might be). — Z boson, Mar 19 '14 at 16:31

reading/writing a matrix with a stride much larger than its width causes a big loss in performance

2 Answers2

Linked