Compiler Optimizations effect on FLOPs and L2/L3 Cache Miss Rate using PAPI

Question

So we've been tasked with an assignment to compile some code (we're supposed to treat it as a black box), using different intel compiler optimization flags (-O1 and -O3) as well as vectorization flags (-xhost and -no-vec) and to observe changes in:

Execution Time
Floating Point Operations (FPOs)
L2 and L3 Cache Miss Rate

After performing these optimizations, we've noticed a drop in execution time, which was to be expected, considering all the changes the compiler makes to your code for the sake of efficiency. However, we also noticed a drop in the number of FPOs, which while we understand that it's a good thing, we're not sure why it happened. Also, we noticed (and cannot explain) an increase in L2 Cache Miss Rate (increasing as the optimization level increased), but no significant increase in Cache Accesses, and almost no changes on the L3 level.

Using no vectorization or optimization at all produced the best result in terms of L2 Cache Miss Rate, and we were wondering if you guys could give us some insight, as well as supported documentation, literature, and resources which we can use to deepen our knowledge on this topic.

Thank you.

edit: The compiler options used are:

O0 -no-vec (Highest execution time, Lowest L2 Cache Misses)
O1 (Less execution time, Higher L2 Cache Misses)
O3 (Even less execution time, Even Higher L2 Cache Misses)
xhost (same order of -O3 execution time, Highest L2 Cache Misses)

Update:

While there is a small decrease in overall L2 cache accesses, there is a massive increase in actual misses.

With -0O -no-vec

Wall clock time in usecs: 13,957,075

L2 cache misses: 207,460,564
L2 cache accesses: 1,476,540,355
L2 cache miss rate: 0.140504
L3 cache misses: 24,841,999
L3 cache accesses: 207,460,564
L3 cache miss rate: 0.119743

With -xhost

Wall clock time in usecs: 4,465,243

L2 cache misses: 547,305,377
L2 cache accesses: 1,051,949,467
L2 cache miss rate: 0.520277
L3 cache misses: 86,919,153
L3 cache accesses: 547,305,377
L3 cache miss rate: 0.158813

May you tell us the compiler and *all* the options passed? Also, what have you supposed so far? — edmz, Oct 29 '14 at 19:15
Did you look at the generated assembly code (i.e. with `gcc -Wall -fverbose-asm -S` and optimization flags)? — Basile Starynkevitch, Oct 29 '14 at 19:39
Fast code can address memory more frequently so gives the caches a much harder time to keep up. — Hans Passant, Oct 29 '14 at 20:29

score 2 · Answer 1 · answered Oct 29 '14 at 19:48

On the reduced number of floating-point ops:
With optimization, the compiler may hoist common calculations out of loops, fuse constants, pre-calculate expressions and so on.

On increased cache miss-rate:
If the compiler uses vectorization, and loads a full vector-width worth of data every time, it will use much fewer loads from memory in total. But every time it accesses a cacheline in a way the predictor didn't anticipate, it still causes a cache miss.
Together, you have fewer loads, but about the same number of cachelines touched, so the rate of misses can be higher.

Peter Cordes · Accepted Answer · 2016-04-16T00:07:52.923

EOF's answer has a good explanation for fewer floating point ops: -ffast-math style combining of operations, so I'll just answer the other part.

The question has no info on what CPU microarchitecture was used, but at least it's tagged intel.

On Intel CPUs, there is some logic to prefetch into L1, and more complex logic to prefetch into L2 (from L3 or main memory). Each core has its own L2, but lower levels of the cache hierarchy are shared, so it's the obvious place to put the main prefetch logic.

If you're reading slower than the limits of memory bandwidth, your loads will hit in L2 because the hardware prefetcher will have already fetched those lines into L2. If prefetching can't keep up, you'll get L2 cache misses.

Fewer wider loads instead of many scalar loads also means the miss % will be worse with vectors. (EOF's answer made this point already). This effect doesn't explain an increase in the absolute number of L2 misses, though, only (part of) the miss % change. Still important to keep in mind when looking at the data, though.

From Intel's optimization guide (links in the x86 tag wiki), Section 2.3.5.4: Data Prefetching:

Data Prefetch to the L2 and Last Level Cache

Streamer: This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses.... When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page.

The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load request.

Adjusts dynamically to the number of outstanding requests per core. If there are not many outstanding requests, the streamer prefetches further ahead. If there are many outstanding requests it prefetches to the LLC only and less far ahead.

When cache lines are far ahead, it prefetches to the last level cache only and not to the L2. This method avoids replacement of useful cache lines in the L2 cache.

Detects and maintains up to 32 streams of data accesses. For each 4K byte page, you can maintain one forward and one backward stream can be maintained.

This is from the Sandybridge section, but the Haswell and Skylake sections don't go into any detail about changes to prefetching. They say "improved prefetching", but presumably it's the same basic design, just with better heuristics and/or better tuning for the existing heuristics, and stuff like that.

Thanks to @HansPassant: his comment on the question made me think of prefetching not keeping up.

Compiler Optimizations effect on FLOPs and L2/L3 Cache Miss Rate using PAPI

2 Answers2