From http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/
"The streaming load instruction is intended to accelerate data transfers from the USWC memory type. For other memory types such as cacheable (WB) or Uncacheable (UC), the instruction behaves as a typical 16-byte MOVDQA load instruction. However, future processors may use the streaming load instruction for other memory types (such as WB) as a hint that the intended cache line should be streamed from memory directly to the core while minimizing cache pollution."
That is, "normal" memory is WB, and hence there is no advantage to using non-temporal loads/stores vs. normal ones. Also, for normal cachable memory, the first load of a cache line will pull the entire cache line into L1, similar to how the first non-temporal load will pull an entire cache line into the special "non-temporal buffer".
As the quote above says, future processors may use the non-temporal load/store as a hint to not pollute the cache. Which might be a good idea in some cases, but maybe not the right choice for a general-purpose memcpy() implementation?