In my knowledge, giving information(like using restrict
, static
on function, __builtin_expect()
, etc) to compiler makes program better or equal. However, this works opposite to what was expected.
This is a function that changes the order of data storage in a matrix(packing method for matrix multiplication). Size of src
matrix is m
* n
, and size of dst
matrix is MAX_M
* MAX_N
. Case 2) line is disabled yet.
// pack.c
#define MAX_M 5000
#define MAX_N 5000
#define EPC 8 // number of Elements Per Cache line
// also AVX-512 SIMD register can hold up to 8 double-precision floating points.
void pack(int m, int n, const double *restrict src, double *restrict dst) {
int upper_n = (n + EPC - 1) / EPC;
int remainder_n = n % EPC;
for (int i = 0; i < m; ++i) {
for (int j = 0; j < upper_n; ++j) {
int len = j < upper_n - 1 || remainder_n == 0 ? EPC : remainder_n; // case 1)
// int len = EPC; // case 2)
for (int k = 0; k < len; ++k) {
dst[i * EPC + j * EPC * MAX_M + k] = src[i * n + j * EPC + k];
}
}
}
}
I used the code below to measure performance of pack
function. This code runs the pack(5000, 5000, A, B)
50 times and measures the average execution time. A
and B
are aligned with 64 bytes, and both sizes are 5000 * 5000.
// main.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define MAX_M 5000
#define MAX_N 5000
#define ITERATION 50
void pack(int m, int n, const double *restrict src, double *restrict dst);
int main(int argc, char **argv) {
int m = 5000;
int n = 5000;
double *A;
double *B;
posix_memalign((void **)&A, 64, sizeof(double) * m * n);
posix_memalign((void **)&B, 64, sizeof(double) * MAX_M * MAX_N);
for (int i = 0; i < m * n; ++i) A[i] = i;
double total_duration = 0;
for (int i = 0; i < ITERATION; ++i) {
double start_time = omp_get_wtime();
pack(m, n, A, B);
double end_time = omp_get_wtime();
double duration = end_time - start_time;
total_duration += duration;
}
printf("avg duration: %.8lf s\n", total_duration / ITERATION);
free(A);
free(B);
return 0;
}
It only calls pack
with n=5000
. It means remaninder_n
in pack
is always 0 and len
is always 8. So I used case 2) instead of case 1) in pack
function.
Then weird thing happens. Performance becomes worse. case 2) is slower than case 1). I gave information(len
is always 8) to compiler, but compiler produced slow code.
avg duration: 0.05746786 s <- case 1)
avg duration: 0.06110375 s <- case 2)
Is it possible that giving information to compiler makes program slower? Or is it just an issue with the compiler?
Target machine is Intel Xeon Phi 7250(Intel Knight Landing). Compile command is icc -o perf_test main.c pack.c -qopenmp -march=knl -O3
. Assembly of pack
function is like this except that mine uses movslq
but the link uses movsxd
.
I tested by modifying some codes. So I could figure out that 'case 1) is faster than case 2)' is a special case.
Case 2) becomes faster than case 1) if I
- change compiler to
gcc
fromicc
- move
pack
function to main.c file - remove
restrict
keyword frompack
function - remove
-march=knl
flag
Case 1) becomes slow as case 2) if I
- change case 1)'s
remainder_n
to any integer literalint len = j < upper_n - 1 || remainder_n == 0 ? EPC : 0; or int len = j < upper_n - 1 || remainder_n == 0 ? EPC : EPC;
In other words, case 2) is slower than case 1) if none of the above conditions are used. I don't know why compiler create slow program when these conditions are given.