6

I was looking over the Linux loopback and IP network data handling, and it seems that there is no code to cover the case where 2 CPUs on different sockets are passing data via the loopback.

I think it should be possible to detect this condition and then apply hardware DMA when available to avoid NUMA contention to copy the data to the receiver.

My questions are:

  • Am I correct that this is not currently done in Linux?
  • Is my thinking that this is possible on the right track?
  • What kernel APIs or existing drivers should I study to help complete such a version of the loopback?
jxh
  • 69,070
  • 8
  • 110
  • 193
  • Why don't use use a [*Unix socket*](http://en.wikipedia.org/wiki/Unix_domain_socket)? – artless noise Apr 29 '15 at 13:54
  • @artlessnoise: Thanks for the suggestion! The source code for unix domain sockets also shows simple copy of data when communicating to a different CPU. I would like to avoid the blocking nature of QPI, and allow hardware assisted DMA to perform the data transfer. – jxh Apr 29 '15 at 14:35
  • Hmm, I see. The socket needs to copy as each end of the socket is (most likely) a different process. Memory-to-memory DMA is not all that common actually (to find in hardware). There is a [DMA infra-structure](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/linux/dmaengine.h#n66) and it would need to be plugged into the network stack. It makes more sense to me for *Unix sockets*. Other possibilities are *COW*; but it depends on use cases in each process. – artless noise Apr 29 '15 at 18:17
  • I see that most of this is implemented in [iov_iter.c](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/lib/iov_iter.c). There is an additional problem that the memory in the process spaces may not even be mapped (it is in swap or paged out). A mem-to-mem DMA needs everything in physical RAM. Also, if you are actually using `memcpy()` it is fortunate as the `copy_from_user` etc usually has an even higher overhead. I am not sure if the x86 uses the DMA code; it might only be PPC and ARM. – artless noise Apr 29 '15 at 18:38

1 Answers1

2

There are several projects/attempts to add interfaces to memory-to-memory DMA Engines intended for use in HPS (mpi):

KNEM may use I/OAT Intel DMA engine on some microarchitectures and sizes

I/OAT copy offload through DMA Engine One interesting asynchronous feature is certainly I/OAT copy offload. icopy.flags = KNEM_FLAG_DMA;

Some authors say that it have no benefits of hardware DMA Engine on newer Intel microarchitectures:

http://www.ipdps.org/ipdps2010/ipdps2010-slides/CAC/slides_cac_Mor10OptMPICom.pdf

I/OAT only useful for obsolete architectures

CMA was announced as similar project to knem: http://www.open-mpi.org/community/lists/devel/2012/01/10208.php

These system calls were designed to permit fast message passing by allowing messages to be exchanged with a single copy operation (rather than the double copy that would be required when using, for example, shared memory or pipes).

If you can, you should not use sockets (especially tcp sockets) to transfer data, they have high software overhead which is not needed when you are working on single machine. Standard skb size limit may be too small to use I/OAT effectively, so network stack probably will not use I/OAT.

osgx
  • 90,338
  • 53
  • 357
  • 513