So we've been tasked with an assignment to compile some code (we're supposed to treat it as a black box), using different intel compiler optimization flags (-O1 and -O3) as well as vectorization flags (-xhost and -no-vec) and to observe changes in:
- Execution Time
- Floating Point Operations (FPOs)
- L2 and L3 Cache Miss Rate
After performing these optimizations, we've noticed a drop in execution time, which was to be expected, considering all the changes the compiler makes to your code for the sake of efficiency. However, we also noticed a drop in the number of FPOs, which while we understand that it's a good thing, we're not sure why it happened. Also, we noticed (and cannot explain) an increase in L2 Cache Miss Rate (increasing as the optimization level increased), but no significant increase in Cache Accesses, and almost no changes on the L3 level.
Using no vectorization or optimization at all produced the best result in terms of L2 Cache Miss Rate, and we were wondering if you guys could give us some insight, as well as supported documentation, literature, and resources which we can use to deepen our knowledge on this topic.
Thank you.
edit: The compiler options used are:
- O0 -no-vec (Highest execution time, Lowest L2 Cache Misses)
- O1 (Less execution time, Higher L2 Cache Misses)
- O3 (Even less execution time, Even Higher L2 Cache Misses)
- xhost (same order of -O3 execution time, Highest L2 Cache Misses)
Update:
While there is a small decrease in overall L2 cache accesses, there is a massive increase in actual misses.
With -0O -no-vec
Wall clock time in usecs: 13,957,075
- L2 cache misses: 207,460,564
- L2 cache accesses: 1,476,540,355
- L2 cache miss rate: 0.140504
- L3 cache misses: 24,841,999
- L3 cache accesses: 207,460,564
- L3 cache miss rate: 0.119743
With -xhost
Wall clock time in usecs: 4,465,243
- L2 cache misses: 547,305,377
- L2 cache accesses: 1,051,949,467
- L2 cache miss rate: 0.520277
- L3 cache misses: 86,919,153
- L3 cache accesses: 547,305,377
- L3 cache miss rate: 0.158813