1

When I use Intel I/OAT for DMA zero-copy/zero-cycles(without CPU) transfer through async_memcpy, then where are mapped device memory to, in virtual addressing: to the kernel-buffer(kernel space) or to the user-buffer(user space)?

And does it make any sense to use I/OAT in modern x86_64 CPUs (when CPU-core can fast access to the RAM without north-bridge of chipset)?

http://www.intel.com/content/www/us/en/wireless-network/accel-technology.html

Alex
  • 12,578
  • 15
  • 99
  • 195

1 Answers1

2

Given that the memory is physical memory, it can be any memory that the kernel can address, including both kernel buffers and user-space buffers. It does however have to be "pinned" or "locked", so that the memory doesn't get taken away (e.g. someone doing free on the memory should not release the memory back to the OS for reassignment to another process, because you could get very interesting effects if that is the case). This is of course the same rules that apply to various other DMA accesses.

I doubt very much this helps in copying data structures for your average user-mode application. On the other hand, I don't believe Intel would put these sort of features into the processor unless they thought it was beneficial in some way. The way I understand it is that it's helpful for copying the network receive buffer into user-mode application that is receiving the data, with less CPU involvement. It doesn't necessarily speed up the actual memory transfer much (if at all), but it offloads the CPU from the to do other things.

I'm pretty sure I saw something not so long ago about this technology [or something very similar] also going into the latest models of processors, so I expect there is some advantage to it.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • Thanks. As I understand, as best practice of this technology, device-buffer mapped exactly to kernel-buffer, and copying kernel-buffer -> user-buffer is doing through DMA(I/OAT) instead of when it is doing by CPU-core. It make to avoid interrupts(IRQ) of CPU-cores and offload it. And i.e. "pinned" for disable "page fault", and "lock" for disable `free()`? But how was this memory mapped region created, was it done by OS, or was it done by I/OAT-driver by using `mmap()`, and how can I create "memory mapped memory" myself? – Alex Sep 01 '13 at 19:59
  • The terms "locked" or "pinned" are used more or less interchangeably, but "locked" is more of a Windows term, "pinned" more of a Linux term (but it's far from consistent). It basically means "even if the user-mode app frees this memory, don't ACTUALLY free it". It also means "don't remove this page from physical memory" (in other words, "no page-fault on this memory"). [Note that you can't get page-faults for "the physical memory changed from being used by application X to being file-system storage for directory Y", and if you happen to have told the DMA to write to that page, goodbye files. – Mats Petersson Sep 01 '13 at 20:03
  • 1
    I don't know what technique is used for this memory allocation, but `mmap` is quite popular for allocating user-mode memory that can be used by the kernel, because it's automatically page-aligned and has all the relevant kernel data structures avaialable (or easy to find). – Mats Petersson Sep 01 '13 at 20:05
  • And can I use `mmap()` for allocate memory mapped region to use it with Intel I/OAT? – Alex Sep 01 '13 at 20:18
  • Without looking at all the source code involved, I'm not going to say 100% yes, but I don't see any reason why it wouldn't work. – Mats Petersson Sep 01 '13 at 20:20
  • 1
    +1. Note that the usual way to "pin" a user-space buffer is via `get_user_pages` (which works on any memory allocated by the user space). The other way, as you mention, is to implement `mmap` for your device driver to let it hand "pinned" memory to user space processes. @MatsPetersson: I am a little skeptical of these gadgets myself, but maybe because my problem is usually total memory bandwidth... – Nemo Sep 01 '13 at 22:50
  • @Nemo: Yes, of course, it won't give more memory bandwidth, and if the CPU is slurping in data at full speed, it will still the same contention on the memory bus, whether `memcpy` or `async_memcpy` is used for the copy of the data. It only helps if the reason the CPU is "busy" is that the data takes a fair bit of CPU time to copy (in particular large packets on high Gb/s devices). – Mats Petersson Sep 01 '13 at 23:39