I am using PAPI library for reading hardware counters. I have noticed that the order of calling PAPI_library_init(PAPI_VER_CURRENT) initialization has an influence on the results I get. My initialization and read of the array is like this:
int retval;
/*
retval = PAPI_library_init(PAPI_VER_CURRENT);
if (retval != PAPI_VER_CURRENT) {
fprintf(stderr, "PAPI library init error!\n");
exit(1);
}
*/
for(int i=0; i < arr_size; i++){
array[i].value = 1;
//_mm_clflush(&array[i]); flushing does not make difference.
}
_mm_mfence();
for(int i=0; i < arr_size; i++){
temp = array[i].value ;
}
_mm_mfence();
retval = PAPI_library_init(PAPI_VER_CURRENT);
if (retval != PAPI_VER_CURRENT) {
fprintf(stderr, "PAPI library init error!\n");
exit(1);
}
The necessity of second loop to read the array is for coherence protocol I believe but it should not be a big deal here. After this, I add native events of MEM_LOAD_RETIRED to the Eventset I want to read and I use PAPI_read around this third loop (I read it before and after the loop and at the end print the difference) :
for(int i=0; i < arr_size; i++){
temp = array[i].value ;
}
where arr_size is 1000 and each element of the array is 64 byte size(equal to cache line). I have disabled all the prefetchers . I compile with gcc -O3 flag for optimization and -lpapi library. with this code, for third loop I get:
L1_HIT: 64, L1_MISS: 1011, L2_HIT: 15, L2_MISS: 996.
However if I uncomment PAPI_library_init before the array initialization and comment it after, the results I get is:
L1_HIT: 73, L1_MISS: 1004, L2_HIT: 990, L2_MISS: 14.
I am testing this in skylake server, cache sizes are:
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 22528K
Now I am a bit confused why would papi initialization influence this results. it's L2 hit and miss that change. All I need is third loop, and the effect of first two loop on counters is not taken into account, I believe.
So any hint for this would be helpful as all the documentation says is this: "PAPI_library_init() initializes the PAPI library. It must be called before any low level PAPI functions can be used. If your application is making use of threads PAPI_thread_init (3) must also be called prior to making any calls to the library other than PAPI_library_init()."