I am writing my own GeMM subroutine for fun. I have managed to achieve a tiled version over the L1 cache with an AVX256 kernel. I am aware there are some loop invariants I could be be hoisting out of the below.
template<size_t N, size_t P, size_t M>
void intrinsic_multiply(double* A, double* B, void* C) {
alignas(32) double _A[PACK_SIZE * PACK_SIZE];
alignas(32) double _B[PACK_SIZE * PACK_SIZE];
double* _C = (double*)(C);
constexpr size_t N_rem = N%PACK_SIZE;
constexpr size_t M_rem = M%PACK_SIZE;
constexpr size_t P_rem = P%PACK_SIZE;
constexpr size_t N_spill = N-N_rem+(PACK_SIZE)*(size_t)(N_rem!=0);
constexpr size_t M_spill = M-M_rem+(PACK_SIZE)*(size_t)(M_rem!=0);
constexpr size_t P_spill = P-P_rem+(PACK_SIZE)*(size_t)(P_rem!=0);
for(size_t i = 0; i < N_spill; i += PACK_SIZE) {
for(size_t j = 0; j < M_spill; j += PACK_SIZE) {
for(size_t k = 0; k < P_spill; k += PACK_SIZE) {
pad_cols<N>(A + i + k * N, _A, ((size_t)(i+PACK_SIZE>N))*N_rem, ((size_t)(k+PACK_SIZE>P))*P_rem);
pad_cols<P>(B + k + j * P, _B, ((size_t)(k+PACK_SIZE>P))*P_rem, ((size_t)(j+PACK_SIZE>M))*M_rem);
macro_kernal_intrinsic<N_spill>(_A, _B, _C + i + (j * N_spill));
}
}
}
}
I am having difficulty implementing multilevel cache tiling as the caches are not multiples of each other. An estimate of the stride sizes for each cache is calculated as follows with the CACHE_SIZE
s given in bytes.
static constexpr size_t L3_CACHESIZE = 6291456;
static size_t L3_STRIDE_SIZE = (size_t)floor((sqrt(L3_CACHESIZE/sizeof(double))));
static constexpr size_t L2_CACHESIZE = 262144;
static size_t L2_STRIDE_SIZE = (size_t)floor((sqrt(L2_CACHESIZE/sizeof(double))));
static constexpr size_t L1_CACHESIZE = 32768;
static size_t L1_STRIDE_SIZE = (size_t)floor((sqrt(L1_CACHESIZE/sizeof(double))));
static constexpr size_t PACK_SIZE = 64;
This gives the L1 stride size as 64 and a L2 stride size of 181. Clearly these are not multiples of each other. I have two options -
- Only fit 4 L1 blocks into each L2 iteration, 0 to 63 to 127. This seems like I'm under utilizing my L2 cache.
- Use the whole L2 cache and pad the (64*3-180) elements with 0 on the last iteration. This would introduce a lot of redundant operations but only reduce the L2 stride size by 1.
- Prefetch a fractional block for the next iteration.
- There is a canonical method of I am unaware of.
Whats the best way to deal with block sizes which are not multiples of each other in practice?
Edit - In response to MSalters:
I have run
nm -AP -t d --demangle ./src/example-GeMM | grep "intrinsic"
which gives
./src/example-GeMM: void intrinsic_multiply<2048ul, 3136ul, 2240ul>(double*, double*, void*) W 17081 434
./src/example-GeMM: void macro_kernal_intrinsic<2048ul>(double*, double*, double*) W 20409 234
./src/example-GeMM: void micro_kernal_intrinsic<2048ul>(double*, double*, double*) W 21343 454
meaning the relevant code sections occupy (234+454+434)/262144 < 1% of the L2 cache, hence I am still motivated to utilize a greater segment of my L2 cache for data.