Where does the L1 data cache missing come from in blocked matrix mul on arm?

Question

I try to optimize integer matrix multiple by dividing them into smaller matrix block to get a better cache hit rate on raspberry pi 3b+ (it is a Cortex-A53 core, with cache line 64 bytes, 4-way associativities. it is 32K byte).

Here is the code:

#define L1_D_CACHE_SZ 32 * 1024
size_t cache_tune_g = 32;

void mat_mul(int *A, int *B, int *C, size_t M, size_t N, size_t strideA, size_t strideB, size_t strideC) {

  for(int i = 0; i < M; i++) {
    int *Ai = A + (N + strideA) * i;
    for(int j = 0; j < M; j++) {
        int sum = 0;
        int *Bj = B + j;

        for (int k = 0; k < N; k++) {
            int *Aik = Ai + k;
            int *Bjk = Bj + (M + strideB) * k;
            sum += (*Aik) * (*Bjk);
        }

        int *Cij = C + (M + strideC) * i + j;
        *Cij = (*Cij) + sum;
    }
  }
}

// if B 'fits' into L1 data cache, then do the multiplication, 
// else divide A and B into 4 sub-matrixes and then call itself recursively.
void mat_mul_opt(int *A, int *B, int *C, size_t M, size_t N, size_t strideA, size_t strideB, size_t strideC) {
  int B_size = sizeof(int) * M * N;
  if (B_size < L1_D_CACHE_SZ/cache_tune_g) {
    mat_mul(A, B, C, M, N, strideA, strideB, strideC);
  } else {
    size_t M_sub = M / 2;
    size_t N_sub = N / 2;
    size_t strideA_sub = N_sub + strideA;
    size_t strideB_sub = M_sub + strideB;
    size_t strideC_sub = M_sub + strideC;

    int *A1 = A;
    int *A2 = A + N_sub;
    int *A3 = A + (N + strideA) * M_sub;
    int *A4 = A3 + N_sub;

    int *B1 = B;
    int *B2 = B + M_sub;
    int *B3 = B + (M + strideB) * N_sub;
    int *B4 = B3 + M_sub;

    int *C1 = C;
    int *C2 = C + M_sub;
    int *C3 = C + (M + strideC) * M_sub;
    int *C4 = C3 + M_sub;

    // due to the result in C is accumulated, order here matters.
    mat_mul_opt(A1, B1, C1, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);
    mat_mul_opt(A2, B3, C1, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);

    mat_mul_opt(A1, B2, C2, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);
    mat_mul_opt(A2, B4, C2, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);

    mat_mul_opt(A3, B1, C3, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);
    mat_mul_opt(A4, B3, C3, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);

    mat_mul_opt(A3, B2, C4, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);
    mat_mul_opt(A4, B4, C4, M_sub, N_sub, strideA_sub, strideB_sub, strideC_sub);
  }
}

And here is the perf result:

 1,244,238,488      cache-references:u                                            (87.41%)
   193,808,545      cache-misses:u            #   15.576 % of all cache refs      (87.42%)
   192,979,016      L1-dcache-load-misses:u                                       (75.14%)
 6,651,396,875      cycles:u                                                      (87.59%)
 3,499,761,427      instructions:u            #    0.53  insn per cycle           (87.62%)
   539,801,098      branches:u                                                    (87.62%)                                            
     1,632,374      armv7_cortex_a7/l2d_cache_refill/:u                                     (87.48%)

   4.847838433 seconds time elapsed

And I set A as 1024x512 and B as 512x1024 in my test. And get there are 262144 calls to mat_mul function and the MxN is 16x8 at the final call of mat_mul.

And my calculation of cache missing is far less than the perf's result, here is:

Because the matrix A is 16x8 and B is 8x16, then each row of B (16* sizeof(int) = 64 Byte) fits into one L1 cache line. And both A and B should fit into L1 cache now (16*8*2*sizeof(int) = 1024 Byte, I assume there is 32KB L1D cache and with association considered said 4-way, 1024 Byte should also be able to fit in it). So the calculation in mat_mul with A (16x8) and B (8x16) should contain 16 + 8 = 24 L1 cache missings. So there are 262,144 * 24 = 6,291,456 cache missings in the whole computation.

But perf's results show there are 192,979,016 cache missings. It is 30 times more than I expected.

So my question is what's wrong with my calculation here? Or where does the extra cache missing come from?

And I also use perf to record where the L1 D cache missing is from, the result is like below. That 99% missing if from mat_mul and 80% of the missing in mat_mul is from the most inner loop's line: sum += (*Aik) * (*Bjk);.

  1.21 │ 9c:┌─→ldr    r0, [r3], #4                                                                                                                                           
  2.84 │    │  ldr    ip, [r1], fp                                                                                                                                           
       │    │  cmp    lr, r3                                                                                                                                                 
 80.42 │    │  mla    r2, ip, r0, r2                                                                                                                                         
       │    └──bne    9c

Thanks!

[The L1 cache is configurable](https://developer.arm.com/docs/ddi0500/g/level-1-memory-system/about-the-l1-memory-system), so you should not assume it is 32 KiB. — Eric Postpischil, Oct 07 '18 at 11:42
yeah, but even it's the smallest configuration 8KB, the 1KB matrix still fit in it. so the assumption should not matter much here. — zwy, Oct 07 '18 at 11:46
If the cache is 32 KiB and 4-way associative, then addresses that are multiples of 32,768/4 = 8192 bytes apart map to the same cache set. Since B is 512x1024 4-byte `int`, then each row of B is 4096 bytes. Then, in an 8x16 tile of B, rows 0, 2, 4, and 6 of that tile are spaced at intervals of 8192 bytes, so they all map to the same cache set. That is four lines, so the set is full. That means any other use of the same set by the A and C tiles will evict B data, or vice-versa. — Eric Postpischil, Oct 07 '18 at 11:48
@EricPostpischil, thanks! your point makes sense. I have not confirmed the L1 is 32 KB yet. But The idea is similar under different L1 cache size configuration (8KB, 16KB, 32KB, 64KB). The problem is my input data size. I tests with a different input like 512x1008, then the cache missing rate drops from 15% to 0.6%. — zwy, Oct 07 '18 at 14:43

Where does the L1 data cache missing come from in blocked matrix mul on arm?

0 Answers0