Tools to help profile cache misses

Question

What tools does the community use to help identify if cache misses are even a problem, and if they are a problem where they are occuring in the code?

The first question is:

How do I identify how much time is being spent waiting for data from main memory as a result of cache misses? Will a sampling profiler like OProfile attribute time to functions waiting on this data? For instance, they will not attribute time to functions waiting on data from disk reads, so one has to wonder if the same is true of waiting for data from memory.

The second question is: If I identify that cache misses are indeed a bottleneck, how do I identify what parts of the code are requesting the uncached memory? Should I use OProfile with LLC_MISSES as the event? Are there other tools that I don't know about? I prefer to stay away from proprietary solutions unless there is a compelling reason to use them, as I don't want to be locked into a certain toolchain in the future.

Thanks for you help!

You should let us know what technology you are using and what you are actually profiling. — Udo Held, Dec 14 '11 at 21:54
I'm afraid I'm not sure what you mean by "what technology you are using" I use gcc to compile on the x86 architecture. I run Fedora 15. I normally use OProfile to profile. The code is a large numerical computation -- it is not clear what the memory access pattern is. Therefore it is not clear weather or not cache misses are even an issue. I can use Cachegrind to profile cache misses, but how do I know if they are even worth optimizing away? For example -- say function foo generates 100000 misses. How do I know how much of the runtime that contributes? — joe2748, Dec 15 '11 at 02:01

score 0 · Answer 1 · answered Apr 28 '12 at 05:56

"Will a sampling profiler like OProfile attribute time to functions waiting on this data? For instance, they will not attribute time to functions waiting on data from disk reads, so one has to wonder if the same is true of waiting for data from memory."

A: yes, on a single threaded CPU a profiler like OProfiler or VTune will attribute time to functions waiting on cache misses.

This works, because the hardware thread that is taking the cache miss is still running. All existing x86s do NOT do SoEMT (Swith On Event Multithreading). It doesn't work for the OS / disk waits, because the process w5aiting on disk is switched out."

It actually still works for a multithreaded CPU, like Intel Hyperthreading. But sometimes results are cleaner if hyperthreading is disabled. Similarly wrt AMD cluster threading on Bulldozer - it should work, but mayber just in case...

Tools to help profile cache misses

1 Answers1