1

I am using PAPI library for reading hardware counters. I have noticed that the order of calling PAPI_library_init(PAPI_VER_CURRENT) initialization has an influence on the results I get. My initialization and read of the array is like this:

int retval;

/*
     retval = PAPI_library_init(PAPI_VER_CURRENT);

     if (retval != PAPI_VER_CURRENT) {
       fprintf(stderr, "PAPI library init error!\n");
       exit(1);
     }
*/

      for(int i=0; i < arr_size; i++){
        array[i].value = 1;
        //_mm_clflush(&array[i]); flushing does not make difference. 
      }
      _mm_mfence();


      for(int i=0; i < arr_size; i++){
        temp = array[i].value ;
      }
      _mm_mfence();



      retval = PAPI_library_init(PAPI_VER_CURRENT);

      if (retval != PAPI_VER_CURRENT) {
        fprintf(stderr, "PAPI library init error!\n");
        exit(1);
      }

The necessity of second loop to read the array is for coherence protocol I believe but it should not be a big deal here. After this, I add native events of MEM_LOAD_RETIRED to the Eventset I want to read and I use PAPI_read around this third loop (I read it before and after the loop and at the end print the difference) :

for(int i=0; i < arr_size; i++){
       temp = array[i].value ;
     } 

where arr_size is 1000 and each element of the array is 64 byte size(equal to cache line). I have disabled all the prefetchers . I compile with gcc -O3 flag for optimization and -lpapi library. with this code, for third loop I get:

L1_HIT: 64, L1_MISS: 1011, L2_HIT: 15, L2_MISS: 996.

However if I uncomment PAPI_library_init before the array initialization and comment it after, the results I get is:

L1_HIT: 73, L1_MISS: 1004, L2_HIT: 990, L2_MISS: 14.

I am testing this in skylake server, cache sizes are:

L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              22528K

Now I am a bit confused why would papi initialization influence this results. it's L2 hit and miss that change. All I need is third loop, and the effect of first two loop on counters is not taken into account, I believe.

So any hint for this would be helpful as all the documentation says is this: "PAPI_library_init() initializes the PAPI library. It must be called before any low level PAPI functions can be used. If your application is making use of threads PAPI_thread_init (3) must also be called prior to making any calls to the library other than PAPI_library_init()."

Ana Khorguani
  • 896
  • 4
  • 18
  • Can you check without `_mm_clflush(&array[i]);`? Can you check for smaller array sizes such as 500 and 300 elements instead of 1000? Did you have qualify the array declaration with `volatile` so that the compiler won't optimize away the loads at `-O3`? – Hadi Brais Mar 07 '19 at 18:20
  • @HadiBrais Yes I have volatile array so reading in temp won't be optimized. I will check without flush. now I am sure I will get the same result but I will try – Ana Khorguani Mar 07 '19 at 18:30
  • @HadiBrais yes I have the same behavior without clflush. Just to make sure, this first loop, when I initialize array, it writes first time in the element and then evicts this cache line right? I did not suspect this before but I tested it today and surprisingly this is the observation I got. I read about on demand zeroing in another post, which as I understood was the reason of RFO case right? So is it somehow related to cache line eviction too? – Ana Khorguani Mar 07 '19 at 18:37
  • for 500, initialization of PAPI after, gives result: L1_HIT: 62, L1_MISS: 513, L2_HIT: 16, L2_MISS: 497, and initializing before: L1_HIT: 67, L1_MISS: 510, L2_HIT: 377, L2_MISS: 133. seems to be same behavior. It's same for 300. After initialization: L1_HIT: 83, L1_MISS: 304, L2_HIT: 6, L2_MISS: 298. initializing before array: L1_HIT: 82, L1_MISS: 302, L2_HIT: 117, L2_MISS: 185, – Ana Khorguani Mar 07 '19 at 18:48
  • It is as if [PAPI_library_init](https://github.com/pyrovski/papi/blob/fcdcc615e5f310e2f67419c3619895414770ca28/src/papi.c#L495) is causing all the L2 lines to be evicted. Looking at the source code, I don't see why would this happen. What about MEM_LOAD_RETIRED.L3_MISS and MEM_LOAD_RETIRED.L3_HIT? – Hadi Brais Mar 07 '19 at 18:58
  • Yes your understanding of the RFO case looks correct. – Hadi Brais Mar 07 '19 at 18:59
  • @HadiBrais ok thank you. but is this normal, that on the first access on the array, it get's the value and the cache line is evicted without flushing? I don't see the good enough reason for it. – Ana Khorguani Mar 07 '19 at 19:04
  • Among the possible explanations include: Either PAPI_library_init has a large working set that barely fits in the L2 (this is not obvious from the source code). Or the L2 cache is evicting line using some sort of a dead block prediction mechanism. – Hadi Brais Mar 07 '19 at 19:09
  • hm ok, I tried L3 events now and seems that array is populated in L3. the result after initialization is this: L2_HIT: 19, L2_MISS: 991, L3_HIT: 990, L3_MISS: 1, – Ana Khorguani Mar 07 '19 at 19:13
  • But how big papi initialization can be? I tried with 5000 elements. seems close to maximum L2 can hold in this case. I got: L2_HIT: 4403, L2_MISS: 550, L3_HIT: 547, L3_MISS: 3, if initialized beforehand and if after then still evicts everithing: L2_HIT: 19, L2_MISS: 4976, L3_HIT: 4976, L3_MISS: 0, – Ana Khorguani Mar 07 '19 at 19:17
  • Yeah it looks like PAPI_library_init is interfering with the contents of the L2 cache. We can find a definitive answer to the question by counting L2 and L1 fills that are caused by PAPI_library_init. Can you use PAPI itself to count events for PAPI_library_init or would it just crash and burn? – Hadi Brais Mar 07 '19 at 19:23
  • :) I have no idea. Well, I can't use PAPI_read before initialization. It get's errors. I will try to call initialization second times, not sure it will work but I will see – Ana Khorguani Mar 07 '19 at 19:29
  • I think it works somehow. I put second initialization PAPI_library_init(PAPI_VER_CURRENT) between reading counters and this is what I got: L1_HIT: 1536, L1_MISS: 61, L2_HIT: 30, L2_MISS: 31, – Ana Khorguani Mar 07 '19 at 19:32
  • I think you need to first access an array that is larger than the L2 size to evict all the lines needed by `PAPI_library_init`. Then call `PAPI_library_init` for the second time and measure `L2_LINES_IN.ALL`. This will tell us how many lines `PAPI_library_init` needs. Now I'm thinking that the instruction footprint of `PAPI_library_init` may be large. – Hadi Brais Mar 07 '19 at 19:50
  • So I have array size of 1 million. I call PAPI_library_init, then go through the array twice, first time I initialize it, second times I read each element, so should be in cache. then I call PAPI_library_init to read counters. this is still the result: L1_HIT: 1530, L1_MISS: 58, L2_HIT: 14, L2_MISS: 44,. Or do you mean something else? – Ana Khorguani Mar 07 '19 at 19:55
  • This means that PAPI_library_init accesses a few data cache lines, but we also need to consider the impact of the instruction cache lines. So I think we should measure `L2_LINES_IN.ALL` which counts both L2 fills for data and instruction lines. – Hadi Brais Mar 07 '19 at 19:58
  • This is weird but maybe because before reading counters I call PAPI_start and PAPI_read, that might be bringing some cache lines for initialization as well? but it seems to me that ALL_LOAD instructions these two functions cause are somewhere 90 – Ana Khorguani Mar 07 '19 at 19:58
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/189618/discussion-between-hadi-brais-and-ana-khorguani). – Hadi Brais Mar 07 '19 at 19:59

0 Answers0