streaming loads and non USWC memory

Question

I just read this rather interesting article, Copying Accelerated Video Decode Frame Buffers.

Where they explain how to do copying from USWC memory as fast as possible using streaming loads.

My question is why this technique would not also speed up normal copies, from non USWC memory?

A streaming load would read an entire cache line in one go instead of the regular load which only load 16 bytes at a time. What am I missing? And copying from a fill buffer to the "cache buffer" which will be written to cache can't have much of an overhead, can it?

+1 for suggestive title (A steaming load is best dumped raw) — sehe, May 16 '11 at 07:41
The description in your last paragraph is completely backwards. Streaming load/store means completely **bypassing** the cache, whereas regular load/store (`MOVDQA`) are performed with the help of the cache. Also keep in mind a single cache line is typically wider than the SIMD register length on each architecture. — rwong, Mar 23 '15 at 16:06

score 7 · Accepted Answer · answered May 16 '11 at 12:22

From http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/

"The streaming load instruction is intended to accelerate data transfers from the USWC memory type. For other memory types such as cacheable (WB) or Uncacheable (UC), the instruction behaves as a typical 16-byte MOVDQA load instruction. However, future processors may use the streaming load instruction for other memory types (such as WB) as a hint that the intended cache line should be streamed from memory directly to the core while minimizing cache pollution."

That is, "normal" memory is WB, and hence there is no advantage to using non-temporal loads/stores vs. normal ones. Also, for normal cachable memory, the first load of a cache line will pull the entire cache line into L1, similar to how the first non-temporal load will pull an entire cache line into the special "non-temporal buffer".

As the quote above says, future processors may use the non-temporal load/store as a hint to not pollute the cache. Which might be a good idea in some cases, but maybe not the right choice for a general-purpose memcpy() implementation?

Right, `memcpy` output is often used right away, so you might get a faster `memcpy`, but the code right after it could be slowed by all the cache misses. (see http://svn.0x00ff00ff.com/mirror/package/avisynth/x86/FilterSDK/IsMovntqFaster.htm) — Peter Cordes, Apr 30 '15 at 21:31

streaming loads and non USWC memory

1 Answers1