1

In my knowledge, giving information(like using restrict, static on function, __builtin_expect(), etc) to compiler makes program better or equal. However, this works opposite to what was expected.

This is a function that changes the order of data storage in a matrix(packing method for matrix multiplication). Size of src matrix is m * n, and size of dst matrix is MAX_M * MAX_N. Case 2) line is disabled yet.

// pack.c

#define MAX_M 5000
#define MAX_N 5000
#define EPC 8  // number of Elements Per Cache line
               // also AVX-512 SIMD register can hold up to 8 double-precision floating points.

void pack(int m, int n, const double *restrict src, double *restrict dst) {
    int upper_n = (n + EPC - 1) / EPC;
    int remainder_n = n % EPC;
    for (int i = 0; i < m; ++i) {
        for (int j = 0; j < upper_n; ++j) {
            int len = j < upper_n - 1 || remainder_n == 0 ? EPC : remainder_n; // case 1)
            // int len = EPC;                                                  // case 2)
            for (int k = 0; k < len; ++k) {
                dst[i * EPC + j * EPC * MAX_M + k] = src[i * n + j * EPC + k];
            }
        }
    }
}

I used the code below to measure performance of pack function. This code runs the pack(5000, 5000, A, B) 50 times and measures the average execution time. A and B are aligned with 64 bytes, and both sizes are 5000 * 5000.

// main.c

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

#define MAX_M 5000
#define MAX_N 5000

#define ITERATION 50

void pack(int m, int n, const double *restrict src, double *restrict dst);

int main(int argc, char **argv) {
    int m = 5000;
    int n = 5000;

    double *A;
    double *B;

    posix_memalign((void **)&A, 64, sizeof(double) * m * n);
    posix_memalign((void **)&B, 64, sizeof(double) * MAX_M * MAX_N);

    for (int i = 0; i < m * n; ++i) A[i] = i;

    double total_duration = 0;
    for (int i = 0; i < ITERATION; ++i) {
        double start_time = omp_get_wtime();
        pack(m, n, A, B);
        double end_time = omp_get_wtime();
        double duration = end_time - start_time;

        total_duration += duration;
    }
    printf("avg duration: %.8lf s\n", total_duration / ITERATION);

    free(A);
    free(B);

    return 0;
}

It only calls pack with n=5000. It means remaninder_n in pack is always 0 and len is always 8. So I used case 2) instead of case 1) in pack function.

Then weird thing happens. Performance becomes worse. case 2) is slower than case 1). I gave information(len is always 8) to compiler, but compiler produced slow code.

avg duration: 0.05746786 s    <- case 1)
avg duration: 0.06110375 s    <- case 2)

Is it possible that giving information to compiler makes program slower? Or is it just an issue with the compiler?

Target machine is Intel Xeon Phi 7250(Intel Knight Landing). Compile command is icc -o perf_test main.c pack.c -qopenmp -march=knl -O3. Assembly of pack function is like this except that mine uses movslq but the link uses movsxd.


I tested by modifying some codes. So I could figure out that 'case 1) is faster than case 2)' is a special case.

Case 2) becomes faster than case 1) if I

  • change compiler to gcc from icc
  • move pack function to main.c file
  • remove restrict keyword from pack function
  • remove -march=knl flag

Case 1) becomes slow as case 2) if I

  • change case 1)'s remainder_n to any integer literal
    int len = j < upper_n - 1 || remainder_n == 0 ? EPC : 0;
    
    or
    
    int len = j < upper_n - 1 || remainder_n == 0 ? EPC : EPC;
    

In other words, case 2) is slower than case 1) if none of the above conditions are used. I don't know why compiler create slow program when these conditions are given.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
enochjung
  • 45
  • 6
  • 1
    `movslq` is the AT&T mnemonic for `movsxd` (https://www.felixcloutier.com/x86/movsx:movsxd), sign-extension from 32 to 64-bit. (Typical if the compiler can't prove an `int` is non-negative when using it to index an array.) – Peter Cordes Dec 18 '22 at 04:31
  • 1
    This is just a manual transpose? Normally matrix libraries have optimized code for this, probably including the case where the storage geometry doesn't match the matrix, i.e. there's padding at the end of each row or not, if that's why you're calling it `pack`. Also, optimized matmul functions can normally handle the input matrices being transposed or not, doing it on the fly or auto-vectorizing in a way that suits the layout, so you normally don't want a separate pass over the whole data just transposing not doing any math work the whole time. – Peter Cordes Dec 18 '22 at 04:38
  • 1
    What kind of answer are you looking for? Why ICC makes different asm, or why the asm difference creates a performance difference on KNL? If the latter, you should show the asm difference in the question. – Peter Cordes Dec 18 '22 at 04:52
  • @PeterCordes Thanks for commenting. My question is nearly latter, "why the asm difference creates a performance difference on KNL?" and "why does ICC produce slow code even I gave more information(`len` is always 8)?" In my knowledge, the more information programmer gives, the more optimization opportunity comes for compiler. It seems nonsensical that providing more information would result in worse performance in the current situation. – enochjung Dec 18 '22 at 05:24
  • I'm studying how to create fast matrix multiplication routine. The `pack` codes are only for this purpose. Thank you for your understanding. – enochjung Dec 18 '22 at 05:34
  • 3
    I would first of all replace all `int` (all which are used for indexing) by `long` (or `ptrdiff_t`), this will prevent a lot of sign-extension code and can help the compiler to optimize the code better. I would also recommend to compare with other compilers, e.g. clang is typically quite good in optimizing shuffles. – chtz Dec 18 '22 at 12:54
  • 2
    Note: `for (int k = 0; k < len; ++k) dst[i * EPC + j * EPC * MAX_M + k] = src[i * n + j * EPC + k];` replaceable with speedier `memcpy(&dst[i * EPC + j * EPC * MAX_M], &src[i * n + j * EPC], sizeof dst[0] * len)`. Consider adding some test code to verify not writing out of array bounds. If pointers are not `restrict`, use `memmove()`. – chux - Reinstate Monica Dec 18 '22 at 13:04
  • 1
    *Is it possible that giving information to compiler makes program slower?* - It shouldn't, but in practice compilers are imperfect. They're not "smart", they're just complex pieces of machinery that spit out machine code. Often they rely on heuristics, not an exact performance model of the target machine. So yeah, it's hopefully rare that this happens, but totally believable. Sometimes just a quirk of code alignment will bite you, other times a compiler just makes worse code. Keep in mind this is ICC, which Intel is replacing with LLVM-based ICX, probably for reasons like this. – Peter Cordes Dec 18 '22 at 23:03

0 Answers0