Off-chip memcpy?

Question

I was profiling a program today at work that does a lot of buffered network activity, and this program spent most of its time in memcpy, just moving data back and forth between library-managed network buffers and its own internal buffers.

This got me thinking, why doesn't intel have a "memcpy" instruction which allows the RAM itself (or the off-CPU memory hardware) to move the data around without it ever touching the CPU? As it is every word must be brought all the way down to the CPU and then pushed back out again, when the whole thing could be done asynchronously by the memory itself.

Is there some architecture reason that this would not be practical? Obviously sometimes the copies would be between physical memory and virtual memory, but those cases are dwindling with the cost of RAM these days. And sometimes the processor would end up waiting for the copy to finish so it could use the result, but surely not always.

http://software.intel.com/en-us/articles/increasing-memory-throughput-with-intel-streaming-simd-extensions-4-intel-sse4-streaming-load/ — Mitch Wheat, Aug 13 '11 at 01:43
http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc195.htm — Mitch Wheat, Aug 13 '11 at 01:45
Sounds a lot like DMA used to copy hard drive contents into memory, except with memory copying to memory. Interesting idea. You could clear up the CPU for other processes, but you might have to have some heavy threading for your single app to take advantage of this. Also, this seems like a bad architecture if you're just copying memory all the time. Probably easier to just send a copy of the pointer to the start of the memory block. — Kibbee, Aug 13 '11 at 01:48
http://communities.intel.com/servlet/JiveServlet/previewBody/6872-102-1-10048/efficient_memory_copy_operations_on_SCC.pdf&ei=1tdFTumBDIbRrQeKxcjuAw&usg=AFQjCNFi-Pfo3VrGXM87FZxaC01hNKJckw — Mitch Wheat, Aug 13 '11 at 01:52
@Kibbee Buffered IO is a pretty standard pattern for network. If I have a simple protocol that is something like "4-bytes indicating length, followed by a blob, followed by the next 4-byte length indicatgor, etc..." then reading blindly into a large buffer (say 128K) and then reading the length from the buffer and then COPYING data out of the buffer into another buffer is often FASTER than executing 2 reads. That's because read is a system call, and has to go into the OS kernel to read from the device. Its a slow process. Imagine how much faster it would be if the copy was off-chip. — SoapBox, Aug 13 '11 at 02:04

srking · Accepted Answer · 2011-08-14T06:08:55.443

That's a big issue that includes network stack efficiency, but I'll stick to your specific question of the instruction. What you propose is an asynchronous non-blocking copy instruction rather than the synchronous blocking memcpy available now using a "rep mov".

Some architectural and practical problems:

1) The non-blocking memcpy must consume some physical resource, like a copy engine, with a lifetime potentially different than the corresponding operating system process. This is quite nasty for the OS. Let's say that thread A kicks of the memcpy right before a context switch to thread B. Thread B also wants to do a memcpy and is much higher priority than A. Must it wait for thread A's memcpy to finish? What if A's memcpy was 1000GB long? Providing more copy engines in the core defers but does not solve the problem. Basically this breaks the traditional roll of OS time quantum and scheduling.

2) In order to be general like most instructions, any code can issue the memcpy insruction any time, without regard for what other processes have done or will do. The core must have some limit to the number of asynch memcpy operations in flight at any one time, so when the next process comes along, it's memcpy may be at the end of an arbitrarily long backlog. The asynch copy lacks any kind of determinism and developers would simply fall back to the old fashioned synchronous copy.

3) Cache locality has a first order impact on performance. A traditional copy of a buffer already in the L1 cache is incredibly fast and relatively power efficient since at least the destination buffer remains local the core's L1. In the case of network copy, the copy from kernel to a user buffer occurs just before handing the user buffer to the application. So, the application enjoys L1 hits and excellent efficiency. If an async memcpy engine lived anywhere other than at the core, the copy operation would pull (snoop) lines away from the core, resulting in application cache misses. Net system efficiency would probably be much worse than today.

4) The asynch memcpy instruction must return some sort of token that identifies the copy for use later to ask if the copy is done (requiring another instruction). Given the token, the core would need to perform some sort of complex context lookup regarding that particular pending or in-flight copy -- those kind of operations are better handled by software than core microcode. What if the OS needs to kill the process and mop up all the in-flight and pending memcpy operations? How does the OS know how many times a process used that instruction and which corresponding tokens belong to which process?

--- EDIT ---

5) Another problem: any copy engine outside the core must compete in raw copy performance with the core's bandwidth to cache, which is very high -- much higher than external memory bandwidth. For cache misses, the memory subsystem would bottleneck both sync and async memcpy equally. For any case in which at least some data is in cache, which is a good bet, the core will complete the copy faster than an external copy engine.

For the sake of argument, assume that this is handled not by a CPU instruction, but by a "memory coprocessor". The requests are made by a kernel driver (so there is no issue with cleanup when a process ends). Also, let's say that a request cannot cross page boundaries, so any request made by user code for larger than a page will be broken down into many serial requests. This will prevent large requests that could allow a low priority process to monopolize a resource. — Gabe, Aug 14 '11 at 05:27
@Gabe - First, a user-kernel-user transition automatically rules out usage for all but very large size copies in order to amortize the fixed cost of getting to the kernel and back. Second, the OS would must add the context spill and fill in the memory coprocessor to the context switch overhead. This further raises the fixed costs and thus further shrinks the use cases. No need to limit to 4KB boundaries since the OS knows everything and can just freeze the coprocessor in flight. Even so, the coprocessor idea still suffers from problems (2) and (3) and (5). — srking, Aug 14 '11 at 06:15
@Gabe, FYI, Intel has implemented DMA engines in chipsets: http://support.dell.com/support/edocs/network/IntelPRO/R167266/en/ioat.htm I can't seem to find a decent reference link, but it's under the banner of IOAT. — srking, Aug 14 '11 at 06:17
What I'm trying to say is that you wouldn't use an async operation for copying data structures that you're working on; you'd use it for copying large blocks of memory like network packets. In other words, you'd use it for memory operations that you want to *not* pollute the cache, so #3 is a pro rather than a con. It also means that #5 doesn't apply because you wouldn't use the async operation for copying data that you expect to already be in cache. — Gabe, Aug 17 '11 at 04:54
@Gabe - 1500 byte network packets aren't nearly large enough to make copy offload a win, but big non-polluting copies do have a place. The best opportunity for this is in the file system, where you might be dealing with 4KB+ size blocks. However, storage controllers all have their own DMA engines for copy offload. Check out the very efficient sendfile() system call for example. — srking, Aug 17 '11 at 05:23
A *single* 1500 byte packet isn't too big, but you can imagine that a scatter/gather operation that needs to move a whole bunch of data between TCP packet payloads and user buffers could use this. A copying garbage collector could also use such an operation. — Gabe, Aug 17 '11 at 06:16

Guy Sirton · Answer 2 · 2011-08-14T06:13:40.357

Memory to memory transfers used to be supported by the DMA controller in older PC architectures. Similar support exists in other architectures today (e.g. the TI DaVinci or OMAP processors).

The problem is that it eats into your memory bandwidth which can be a bottleneck in many systems. As hinted by srking's answer reading the data into the CPU's cache and then copying it around there can be a lot more efficient then memory to memory DMA. Even though the DMA may appear to work in the background there will be bus contention with the CPU. No free lunches.

A better solution is some sort of zero copy architecture where the buffer is shared between the application and the driver/hardware. That is incoming network data is read directly into preallocated buffers and doesn't need to be copied and outgiong data is read directly out of the application's buffers to the network hardware. I've seen this done in embedded/real-time network stacks.

DigitalRoss · Answer 3 · 2011-08-14T21:38:38.593

Net Win?

It's not clear that implementing an asynchronous copy engine would help. The complexity of such a thing would add overhead that might cancel out the benefits, and it wouldn't be worth it just for the few programs that are memcpy()-bound.

Heavier User Context?

An implementation would either involve user context or per-core resources. One immediate issue is that because this is a potentially long-running operation it must allow interrupts and automatically resume.

And that means that if the implementation is part of the user context, it represents more state that must be saved on every context switch, or it must overlay existing state.

Overlaying existing state is exactly how the string move instructions work: they keep their parameters in the general registers. But if existing state is consumed then this state is not useful during the operation and one may as well then just use the string move instructions, which is how the memory copy functions actually work.

Or Distant Kernel Resource?

If it uses some sort of per-core state, then it has to be a kernel-managed resource. The consequent ring-crossing overhead (kernel trap and return) is quite expensive and would further limit the benefit or turn it into a penalty.

Idea! Have that super-fast CPU thing do it!

Another way to look at this is that there already is a highly tuned and very fast memory moving engine right at the center of all those rings of cache memories that must be kept coherent with the move results. That thing: the CPU. If the program needs to do it then why not apply that fast and elaborate piece of hardware to the problem?

Isn't it actually the *GPU* that is already a highly tuned and very fast memory moving engine? The CPU is a general purpose processor that just happens to be able to move bits around in memory. Graphics processors (the original 2D "graphics accelerators") were originally created for the purpose of moving bits around in memory (scrolling, drawing sprites) because general purpose CPUs are just not that good at it. — Gabe, Aug 17 '11 at 06:21

Off-chip memcpy?

3 Answers3

Net Win?

Heavier User Context?

Or Distant Kernel Resource?

Idea! Have that super-fast CPU thing do it!