0

I'm attempting to use VsPerfCmd.exe to profile branch misprediction and last level cache misses in an instrumented native application.

The setup works as it says on the tin, but the results I'm getting don't seem sensible. For instance, a function that always touches a data set of 24MB is reported to only cause ~700 cache misses when being called ~2000 times. Now let me put this into perspective - The function linearly traverses two arrays of 1024*1024 elements of 12-byte elements. For every element, it randomly decides whether it needs information of an element 1024 indices before or after it. That means in order to not generate any cache misses, the CPU would always have to have at least three sections of 1024*12 bytes each of both these arrays in cache. Furthermore, after every iteration the process yields the CPU using sleep() for about 8 milliseconds. I can't imagine any hardware prefetcher doing that good a job.

How would this silly amount of data not generate more last level cache misses than VsPerfCmd says? Even though my i7 has 8MB of shared L3 cache, this seems highly unlikely. Can anyone share their opinions on what might be going on here? Of course "VsPerfCmd.exe sucks" would be a valid answer but if someone is going to say that, I'd like to at least hear of a similar experience someone had as a basis for this assertion.

Vaillancourt
  • 1,380
  • 1
  • 11
  • 42
Koarl
  • 246
  • 1
  • 2
  • 10

2 Answers2

2

Answering my own question - So, after trying to verify the VsPerfCmd results using Intel VTune Amplifier XE™ (this is no advertising, I just like typing out product names like that because it amuses my how they can be so silly), I can definitely say that they are garbage.

That's just a rough comparison, as I havent found out how to get the number of times a function was called from VTune, but an approximate 900 calls resulted in 1,040,000 Last Level Cache misses, according to VTune. Contrasting that to the ~ 2000 calls profiled with VsPerfCmd and and the reported ~ 700 LLC misses, it's safe to assume that the VTune results are much more reasonable.

Of course I cant say anything more specific than "VsPerfCmd was very likely wrong" - The why's and the how's of this phenomenon remain unclear. Should anyone who knows more feel an urge to elaborate on this, shoot me a comment!

Koarl
  • 246
  • 1
  • 2
  • 10
2

first off - hardware LLC miss counter (let's call it that) does not in reality count LLC misses in your particular application. What it does is it counts all LLC misses and compares the number with a preset threshold (called SAV - sample after value, it usually is in the order of thousands or even millions). If current count is equal to SAV an interrupt is raised and whatever IP is pointed to at this point is saved in the trace alongside the counter and timestamp (for instance to make the trace reasonable). If this IP points to an instruction in your module, then all of those cache misses are attributed to your module/function/instruction. So the resulting picture that you see is not real, but rather statistically correct. I've not worked with VsPerfCmd, but what could help is for you to check the SAV it sets for LLC misses. If it's orders of magnitude larger than what VTune sets, then it could be that you're comparing 700,000 LLC misses with 1,040,000 LLC misses, which would make much more sense.

And then the subject of your application workload and working set. The 3 x 1024 x 12B is only 36KB, which is nothing for an 8MB LLC. If the algorithm is uniformly jumping back and forth, not always forward or backwards, then it'll be just a small portion of those 24 MB that is being used frequently, which means that hottest data will most likely fit in LLC as well. Additionally, CPU only sees memory in chunks called cache lines, which are 64 bytes long. So whenever your algorithm jumps forward or backwords to access the next 12 bytes, 52 neighboring bytes are being loaded into L1, so if the next step after the jump is *(ptr++), then it will not result in a cache miss. Yielding the CPU for 8 milliseconds should not affect cache performance unless you suspect that the thread scheduled for this next quantum is doing something else which is also memory-intensive, which will cause your data cache lines to be evicted. Otherwise, if it's just some OS thread progressing in the background touching just a few bytes, then massive cache eviction should not be happenning.

Anton Pegushin
  • 450
  • 3
  • 11
  • Thank you very much for your insightful suggestions! I'll definitely check the VsPerfCmd documentation for the SAV threshold. – Koarl Apr 26 '12 at 18:48
  • EDIT: As for cache lines and stuff, I know that - only after consulting with a colleague working on optimizing console games, I couldn't imagine the algorithm performing that well. Of course the things he pointed out were said with relatively simple console hardware in mind. Of course, the sophisticated caching mechanisms on PC CPUs might make the results possible. – Koarl Apr 26 '12 at 18:54