4

I am writing my own GeMM subroutine for fun. I have managed to achieve a tiled version over the L1 cache with an AVX256 kernel. I am aware there are some loop invariants I could be be hoisting out of the below.

template<size_t N, size_t P, size_t M>
void intrinsic_multiply(double* A, double* B, void* C) {
    alignas(32) double _A[PACK_SIZE * PACK_SIZE];
    alignas(32) double _B[PACK_SIZE * PACK_SIZE];
    double* _C = (double*)(C);

    constexpr size_t N_rem = N%PACK_SIZE;
    constexpr size_t M_rem = M%PACK_SIZE;
    constexpr size_t P_rem = P%PACK_SIZE;

    constexpr size_t N_spill = N-N_rem+(PACK_SIZE)*(size_t)(N_rem!=0);
    constexpr size_t M_spill = M-M_rem+(PACK_SIZE)*(size_t)(M_rem!=0);
    constexpr size_t P_spill = P-P_rem+(PACK_SIZE)*(size_t)(P_rem!=0);

    for(size_t i = 0; i < N_spill; i += PACK_SIZE) {
        for(size_t j = 0; j < M_spill; j += PACK_SIZE) {
            for(size_t k = 0; k < P_spill; k += PACK_SIZE) {
                pad_cols<N>(A + i + k * N, _A, ((size_t)(i+PACK_SIZE>N))*N_rem, ((size_t)(k+PACK_SIZE>P))*P_rem);
                pad_cols<P>(B + k + j * P, _B, ((size_t)(k+PACK_SIZE>P))*P_rem, ((size_t)(j+PACK_SIZE>M))*M_rem);
                macro_kernal_intrinsic<N_spill>(_A, _B, _C + i + (j * N_spill));
            }
        }
    }
}

I am having difficulty implementing multilevel cache tiling as the caches are not multiples of each other. An estimate of the stride sizes for each cache is calculated as follows with the CACHE_SIZEs given in bytes.

static constexpr size_t L3_CACHESIZE = 6291456;
static size_t L3_STRIDE_SIZE = (size_t)floor((sqrt(L3_CACHESIZE/sizeof(double))));

static constexpr size_t L2_CACHESIZE = 262144;
static size_t L2_STRIDE_SIZE = (size_t)floor((sqrt(L2_CACHESIZE/sizeof(double))));

static constexpr size_t L1_CACHESIZE = 32768;
static size_t L1_STRIDE_SIZE = (size_t)floor((sqrt(L1_CACHESIZE/sizeof(double))));

static constexpr size_t PACK_SIZE = 64;

This gives the L1 stride size as 64 and a L2 stride size of 181. Clearly these are not multiples of each other. I have two options -

  1. Only fit 4 L1 blocks into each L2 iteration, 0 to 63 to 127. This seems like I'm under utilizing my L2 cache.
  2. Use the whole L2 cache and pad the (64*3-180) elements with 0 on the last iteration. This would introduce a lot of redundant operations but only reduce the L2 stride size by 1.
  3. Prefetch a fractional block for the next iteration.
  4. There is a canonical method of I am unaware of.

Whats the best way to deal with block sizes which are not multiples of each other in practice?

Edit - In response to MSalters:

I have run

nm -AP -t d --demangle ./src/example-GeMM | grep "intrinsic"

which gives

./src/example-GeMM: void intrinsic_multiply<2048ul, 3136ul, 2240ul>(double*, double*, void*) W 17081 434
./src/example-GeMM: void macro_kernal_intrinsic<2048ul>(double*, double*, double*) W 20409 234
./src/example-GeMM: void micro_kernal_intrinsic<2048ul>(double*, double*, double*) W 21343 454

meaning the relevant code sections occupy (234+454+434)/262144 < 1% of the L2 cache, hence I am still motivated to utilize a greater segment of my L2 cache for data.

Ivor Denham-Dyson
  • 655
  • 1
  • 5
  • 24
  • 1
    `_A` is a reserved name, that's Undefined Behavior right there. – MSalters Feb 02 '21 at 14:06
  • You won't be able to go past a certain limit with "classical" (row or column major) orderings. All "faster" matrix multiplication algorithms will reorder the matrices into some more "convenient" ordering. – oakad Feb 10 '21 at 06:07

2 Answers2

0

My assumption would be that L1 is split, and then 32Kb is just the L1d cache. L2 will be unified, which means you shouldn't use all of it just for your data.

Also, I wouldn't worry too much about loop hoisting yet. With all the constexpr, the optimizer has a fighting chance to discover the hoists. One issue is that C appears to be the out-variable. Since that's void*, it can alias a size_t. Had you written double* C, this could not alias size_t. This is an example how typesafety is not just good for safety, but also for speed.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • I'll change ```_A```. Yes L1 is split from a 64kb cache. Thanks for tip about aliasing. L3 is the shared cache, each processor has its own L1 and L2 cache. Hopefully someone can still suggest some clarity on dealing with the different cache sizes. – Ivor Denham-Dyson Feb 02 '21 at 14:21
  • 1
    @IvorDenham-Dyson: I think you missed my point when I said L2 was unified. That doesn't mean that it's shared between cores. It means it's shared between code and data. Hence, your idea of "under-utilizing" L2 is fine, because it leaves room in L2 for code. – MSalters Feb 02 '21 at 15:11
  • Excellent point, didn't know L2 was generally unified. Please see my edit. – Ivor Denham-Dyson Feb 02 '21 at 15:51
0

Please see the following paper (and similar on scholar.google.com). I think this one fits well to your question. (BTW, I suppose you already use this "_mm256_mul_pd".) Anyway, you will find the pdf in the internet (don't want to link it here).

A Coordinated Tiling and Batching Framework for Efficient GEMM on GPUs, X Li, Y Liang, S Yan, L Jia, Y Li - Proceedings of the 24th Symposium on …, 2019 - dl.acm.org

"The fundamental problems are tiling, batching,and their synergistic interaction. Tiling means to tile each GEMM into many tiles. We allow different GEMMs to have different tiling strategies instead of sharing a uniform tiling strategy. How to unify different tiling strategies into a single kernel is a challenge."

"Given a GEMM with size M×N×K, the C matrix is partitioned into multiple tiles with size BY×BX. Each tile of C needs to access a whole row section of A matrix with size BY×K and a whole column section of B matrix with size K×BX in Figure 1(a). However, the whole row band of A and column band of B are too large to be accommodated in theshared memory and register file. To use the on-chip memory,the workload along K-dimension has to be partitioned into many segments as shown in Figure 1(b). Each segment of the row section of A is called an A tile with size BY×BK,and each segment of the column section of B is called a B tile with size BK×BX. The final result can be obtained by accumulating the partial result of each segment along K-dimension."

COMMENT: I like the idea to take different solution into account - or to let it "optimize" to find best solution. (like: why should you find out which tile size performs best, let the computer find out. Well, it is clear, it is more effort to program a variable setting of tile size). Well, this is research (I just know that you better store and access the data in the RAM in a regular order - or otherwise you can see "cache misses" which can be a huge performance penalty. Another thought: Some other processes might like to live in the cache, too. I am not sure, how you take that into account - it is not mentioned.)

sidcoder
  • 460
  • 2
  • 6