0

I am trying to profile an application using perf and I am for now interested only in the traffic to/from DRAM. I was not able to understand from the results what is the throughput this application is getting from DRAM.

This is how I invoked the perf command:

perf stat -av -e LLC-misses,cache-misses,L1-dcache-load-misses <application>

I am using -a since this application does communicate with another daemon process which is already running.

The result I obtain is the following:

LLC-misses: 0 288628898 288606144
cache-misses: 373507 287154835 287143402
L1-dcache-load-misses: 3831372 286357135 286357135

 Performance counter stats for './mclient -d tpch-sf1 /home/lottarini/Desktop/DPU/queries/tpch-monetdb/02.sql':

                 0 LLC-misses                                                   [99.99%]
           373,507 cache-misses                                                 [100.00%]
         3,831,372 L1-dcache-load-misses                                       

       0.035855129 seconds time elapsed

My understanding is that cache-misses is the number of memory references that missed throughout the whole cache hierarchy. This is consistent with the fact that I get much more L1 misses than cache-misses.

First of all why doesn't the tool output a confidence value for the L1 misses?

Why is the number of cache-misses different from the LLC-misses value? If something misses in the whole cache hierarchy it has to miss in the LLC.

Moreover, if I wanted to extract the amount of data that was being transferred due to these misses how can I compute that? Is there a perf event option that I can specify or do I need to multiply these numbers with the size of block of memory [who knows] which is transferred in case of a miss?

igon
  • 3,016
  • 1
  • 22
  • 37
  • Can you tell which CPU architecture you are running this on? The perf output totally depends on how Linux kernel is configured for that architecture. – Milind Dumbare Nov 03 '14 at 08:52
  • Proc: http://ark.intel.com/products/52213/Intel-Core-i7-2600-Processor-8M-Cache-up-to-3_80-GHz uname -a: Linux c1 3.2.0-70-generic #105-Ubuntu SMP Wed Sep 24 19:49:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux – igon Nov 03 '14 at 15:44

1 Answers1

1

The events listed by perf list are not all events that can be monitored in a system.

You can install libpfm to get a list of all available event counters in your system with the command showevtinfo. In the case of a sandy bridge machine there will be three set of counters that showevtinfo display:

  1. perf_events generic PMU: These corresponds to the event listed by perf list.
  2. ix86arch (Intel X86 architectural PMU): These are performance counters available in all Intel x86 architectures.
  3. snb (Intel Sandy Bridge): Which are the counters specific to the Sandy Bridge architecture.

After having identified a counter which is interesting you can pass it as an option to perf stat with -e. For the specific case of LLC-misses I found three counters that seems relevant from the three differet sets:

  1. cache-misses which is in the standard list from perf list.
  2. L3_LAT_CACHE:MISS
  3. LLC_MISSES

What is nice of showevtinfo is that a description is added for every counter that is machine specific. Moreover, in case you are trying to profile on a Intel machine you can find the whole list of available counters in the Intel Developer Manual You can use the check_events program that comes with libpfm to translate the name of the counter to a code that can be passed to perf, e.g:

Requested Event: LAST_LEVEL_CACHE_MISSES
Actual    Event: snb::L3_LAT_CACHE:MISS:k=1:u=1:e=0:i=0:c=0:t=0
PMU            : Intel Sandy Bridge
IDX            : 142606383
Codes          : 0x53012e

And then then use the code at the end:

sudo perf stat -r 10 -a -e cache-misses,r53012e,r53412e <command>

10,553,469 cache-misses                                                  ( +-  1.60% ) [100.00%]
10,556,094 r53012e                                                       ( +-  1.60% ) [100.00%]
10,557,004 r53412e                                                       ( +-  1.60% )

Which confirms how all these counters refer in fact to the same thing.

Finally you can multiply these values with the size of a cache block to get the actual amount of data transferred.

igon
  • 3,016
  • 1
  • 22
  • 37