1

I am using PAPI to count L1 cache access results. Mostly native events give expected results, however there is one case when L1_MISS is not precise. I have an object of 64 in size and volatile array of 100,000 elements as shown in the code:

typedef struct _object{
  int value;
  char pad[60];
} object;

#define arr_size 100000
volatile object  array [arr_size];

void * loop (int arg){
  /* Threads are set in NUMA2 */
  int temp;
  for(int i=0; i < arr_size; i++){
    temp = array[i].value;
  }
}

I am testing in Skylake processor with two NUMA nodes. I have disabled prefetchers. Compiling with gcc -O3. The scenario is the following: From the main process which is set in NUMA1, I initialize an array and flush the cache lines. Then I create 5 threads, that read the same array from NUMA2 by calling loop function. After all of them are terminated, I loop over an array from the main process, reading each element and monitor L1 cache access results:

int main(int argc, char* argv[]){
    /* Main thread is set in NUMA1 */
    /* Array is initialized and flushed from the cache*/
    /* 5 threads are created with pthread_create, that call loop function, 
       and waited to finish by calling pthread_join*/

 int tmp;
 /*Hardware counters are counted for this loop*/ 
 for(int i=0; i < arr_size; i++){
    tmp = array[i].value;
 }
}

I am reading these 5 native event counters:

MEM_INST_RETIRED.ALL_LOADS: 100095
L1D.REPLACEMENT: 100246 
MEM_LOAD_RETIRED.L1_HIT: 113 
MEM_LOAD_RETIRED.L1_MISS: 56 
MEM_LOAD_RETIRED.FB_HIT: 55 

The expectation was to see L1_MISS around 100,000 because elements are not fetched in cache and this read in main should cause miss. also ALL_LOADS is not equal to sum of three counters: L1_HIT + L1_MISS + FB_HIT. Even though L1D.REPLACEMENT seems to make sense in this case by counting L1D data line replacements, I'm not convinced with it as it counts prefetching as well when enabled.

I don't understand what can be a reason of MEM_LOAD_RETIRED.L1_MISS counter not seeing events caused by read operation in main, only in this specific scenario. As for example, if threads from NUMA2, instead of reading, modify an array element, then for the same loop I get L1_MISS: 99818. so any suggestion would be helpful. I tried to provide the main skeleton of the code. If any parts of the commented points are important I can add them as well.

Ana Khorguani
  • 896
  • 4
  • 18

0 Answers0