Portable explicit prefetch

Question

I am in need of a simple and portable way to explicitly prefetch data. I do not want to use the specific feature of any specific compiler or platform, just something generic enough to work across different platforms and compilers.

One very naive solution that comes to mind is just move a byte/int from the memory location to a register, that "should" bring up that memory segment into the CPU cache to fill a line, at least this is what I logically assume. But maybe it won't be that easy? One possibility is for the compiler to optimize away the operation if that data is not accessed in the particular scope, so no prefetching will occur.

You'll need to watch out for the compiler optimising away your memory reads as it'll think they are not doing anything. — Ian Cook, Feb 13 '14 at 17:28
There is no portable way of doing this in c or c++ because there is no guarantee that you are compiling to native code on a register machine. e.g. interpreted c++ - http://root.cern.ch/drupal/content/cling and compiling to the jvm - http://nestedvm.ibex.org — user1937198, Feb 13 '14 at 17:33
`_mm_prefetch` should be fairly portable. Contrary to what MSDN page says SSE intrinsics are not Microsoft specific and are available on at least a few of the most popular compilers (GCC, Clang, Intel, MSVC). — user2802841, Feb 13 '14 at 18:12

score 2 · Answer 1 · answered Feb 13 '14 at 17:40

2

Generally speaking, prefetching and memory loads are not exactly the same operations. There are a few fundamental differences:

Prefetching invalid address does not generate faults whereas attempting to read, write or execute invalid address generates a fault (if CPU has MPU/MMU, of course).
Prefetching can be done for reading and/or writing whereas just reading a byte into register is just reading a byte into register.
You can (theoretically) specify memory locality when prefetching.
CPU might have special instructions for prefetching that are not the same as memory load instructions.

So just stick with __builtin_prefetch and let the compiler do the hard work.

Also, keep in mind that optimizing compilers may generate prefetch instructions automatically. I guess if they do, then you'd have to make sure you do not interfere with that.

Another interesting thing is that, in general, explicit prefetching does not improve performance but slightly degrades it instead. See this LWN article for details and explanation why prefetching was totally removed from the Linux kernel.

Hope it helps. Good Luck!

answered Feb 13 '14 at 17:40

I want to prefetch the next node while processing the current, I don't think compilers look that much "further ahead". The nodes are not sequential in memory, so I do not expect the CPU hardware prefetchers to do any good as well. – dtech Feb 13 '14 at 17:46
@ddriver: Do not guess, make a change and profile. As kernel developers have proven, manual prefetching degrades performance in general case (just read the article). So it is a harmful false optimization. – Feb 13 '14 at 17:52
@ddriver Has doing it non-portable way (`__builtin_prefetch`, etc.) resulted in any speed up? There is little point thinking about most portable way otherwise. In my limited experience, every time I used manual prefetching it ended up being slower or at most similar. – user2802841 Feb 13 '14 at 17:53
I get a tangible boost using `__builtin_prefetch` in that particular scenario, however I do agree, OVERusing anything is usually detrimental. – dtech Feb 13 '14 at 17:55
@ddriver, don't forget that the CPU runs ahead over the code (at least as far as its queues can hold), so once the next nodes' address is known it's likely to be fetched by an actual load, so prefetching it would be redundant. Have you tried to prefetch several nodes ahead? – Leeor Feb 15 '14 at 21:29

Portable explicit prefetch

1 Answers1