2

I'm trying to count the number of cache hit at different levels (L1, L2 and L3) of cache for a program on Intel Haswell processor.

I wrote a program to count the number of L2 and L3 cache hits by monitoring the respective events. To achieve that, I checked Intel x86 Software Development Manual and used the cache_all_request event and cache_miss event for L2 and L3 cache. However, I didn't find the events for L1 cache. Maybe I missed something?

My questions are:

Which Event Number and UMASK value I should use to count the L1 cache hit events?

Clarifications*

1) The final goal I want to achieve is to upper bound a program's execution time when all cache hits of the program become cache misses. If I can count the number of cache hit requests, I can treat them as cache miss and calculate the increased execution time;

2) I checked the event MEM_LOAD_UOPS_RETIRED.L1_ HIT in Intel SDM, it says "Retired load uops with L1 cache hits as data sources.". I'm not sure if 1 uops takes 1 cycle. Is there any reference about how to transfer uops to cycles?

3) It will be better to count both loads and stores. (I can tolerate not counting store requests though.)

Thank you so much for your help!

Mike
  • 1,841
  • 2
  • 18
  • 34
  • 3
    There's `mem_load_retired.l1_hit`. (Use [the ocperf.py wrapper](https://github.com/andikleen/pmu-tools) so you don't need the event/umask numbers, or get the numbers from it.) Alternatively, use `L1-dcache-loads - L1-dcache-load-misses`. (There are far fewer HW events that count stores, mostly because the CPU doesn't have to wait for them and they don't commit until after the store instruction retires. **Did you want hits for loads+stores, or are you ok with just loads?**) – Peter Cordes Mar 01 '18 at 05:31
  • @PeterCordes Thank you so much for your help! I modified my question to answer your question and clarify my questions. I hope you could guide me. Thanks! – Mike Mar 07 '18 at 15:27
  • 2
    How the heck are you planning to *calculate* the increased execution time? You're going to need to know which loads were dependent on other loads, to figure out how many cache misses can be in flight at once (memory parallelism). And exactly how well the out-of-order core can hide the latency of cache misses. If there is independent work that doesn't depend on a load, it can be executed and ready to retire once the load completes. – Peter Cordes Mar 07 '18 at 20:57
  • As Peter says, you probably can't make the calculation you suggest in any general, reasonable way. – BeeOnRope Mar 07 '18 at 21:01
  • @PeterCordes Yes, I understood what you said. I totally agree that precisely calculating the increased execution time is impossible. I intended to **upper bound** a program's execution time if all of its cache hit requests become cache miss. So I just assume when an extra cache miss occurs, the following instructions cannot be retired until the cache miss request is served. Does that make more sense? Thank you very much! – Mike Mar 08 '18 at 03:02
  • @BeeOnRope Thank you very much for your comment! How about **upper bound** the increased execution time? It should be doable if I know the extra cache miss number. I agree it could be pessimistic, but it provides some hints on how slow the program will be if cache is disabled. – Mike Mar 08 '18 at 03:04
  • 1
    That won't give you an upper bound. Skylake's retirement throughput is something like 8 uops per clock (or maybe even 16 per clock if both threads have instructions to retire). Note that out-of-order execution + in-order retirement tends to lead to bursty retirement when an old instruction finally finishes executing, and retirement faster than the issue bottleneck can maybe free up resources earlier. Anyway, to get anything like an upper bound, **you need to account for which dependency chains involve the loaded data, because they can't *execute* until it completes.** So latency... – Peter Cordes Mar 08 '18 at 03:36
  • Basically you need to simulate the whole out-of-order machinery. For an upper bound you could maybe simulate as if it was an in-order machine that can't start executing a later instruction until a cache-miss load completes, but that would be an extremely loose upper bound, to the point of being useless. The whole idea sounds very unrealistic anyway, because store/reload to the stack is very common, and won't in practice miss. And store-forwarding still works even when the line isn't present in L1D cache. – Peter Cordes Mar 08 '18 at 03:40
  • 1
    *Is there any reference about how to transfer uops to cycles?* See http://agner.org/optimize/, and other performance links in https://stackoverflow.com/tags/x86/info, for more details on the microarchitectures of modern x86 CPUs. It is *not* simple; surrounding code and out-of-order execution cannot be ignored. Skylake's ROB is 224 uops, and scheduler is 97 uops. So it can have 97 uops waiting for inputs / execution resources in flight at once. – Peter Cordes Mar 08 '18 at 03:43
  • @Mike - you could do something like multiply the number of hypothetical additional misses (if all hits were misses) by the latency to DRAM (typically 50 to 100ns on modern hardware) for some sort of upper bound. I'm not even sure this is a strict upper bound (since due to scheduling effects, long misses might reduce the _concurrency_ as well as the latency), but I suspect it is an upper bound for most real world code. The problem is that this upper bound is terribly loose: I suspect a lot of code could perform an order of magnitude better (e.g., see Peter's comments). – BeeOnRope Mar 09 '18 at 05:09
  • Perhaps if you explained your motivation or use-case for this calculated value some better advice could be given. – BeeOnRope Mar 09 '18 at 05:09

0 Answers0