Yes, the PMU unit in your CPU can probably do what you want through the various uncore counters - in particular they can count the various offcore responses for non-local memory access. This blog post is a reasonable starting point.
The main problem is that often the perf
tool, which is tied to the specific kernel version, will lag behind in its support of modern processors1, especially when it comes to uncore and NUMA related events2.
To work around that, you can use Andi Kleen's pmu-tools, which provides an ocperf
wrapper script that uses whatever underlying perf
you have on your system but with up-to-date event ids downloaded directly from Intel. That will usually give you access to the uncore events you need.
Of course, even when you get that working, these events are often very tough to interpret, especially because the mental model you have of demand-memory requests is complicated by a ton of factors such as prefetch behavior, request-for-ownership, accesses that "hit" in a line-buffer in the process of being filled, etc, etc.
1 Both because adding new processors/events as some lag, but especially because the tool is tied to the kernel, and you likely aren't on a bleeding edge kernel, so even though mainline perf
might have support, you are stuck with the perf
version associated with your kernel.
2 Probably because most kernel developers, like developers in general, aren't working on NUMA systems.