3

I am struggling to understand the relationship between libibverbs and librxe and the low-level kernel driver for the HCA.

Specifically, I have the following doubts :

  • When a packet arrives on the HCA, the low-level kernel driver passes the packet to the userspace application. There is a memory copy involved here. In this picture, where do libibverbs and librxe sit?
  • Similarly a send command issued by the user must be able to directly talk to the hardware via the low-level driver. What is the need to have the userspace libraries in this case?
byslexia
  • 329
  • 2
  • 8

3 Answers3

6

The InfiniBand verbs implementation consists of roughly four components:

  • a vendor-specific kernel module (e.g. ib_mthca for Mellanox devices)
  • a kernel module that allows verbs access from userspace (ib_uverbs)
  • an user-space vendor driver library (e.g. libmthca)
  • a glue component between the previous two (libibverbs)

InfiniBand supports in general two semantics - packet-based operation and remote DMA. No matter the mode of operation, both implement zero-copy by directly reading from and writing to the application buffer(s). This is done (as already explained by haggai_e) by fixing the buffer in physical memory (also called registering), thus preventing the virtual memory manager from swapping it off to the disk or moving it around in the physical RAM. A very nice feature of InfiniBand is that each HCA has its own virtual-to-physical address translation engine which allows one to pass userspace pointers directly to the hardware.

The reason to have a user-level driver is that verbs exposes directly the HCA's hardware registers to the userspace and each HCA has a different set of registers, therefore the need for an intermediate userspace layer. Of course, it could be implemented entirely in the kernel and then a single vendor-independent userspace library could be used, but InfiniBand tries very hard to provide as low latency as possible and having to go through the kernel every time will be very expensive. The fact that RDMA devices can translate virtual addresses on their own means that the userspace library does not have to go through the kernel in order to obtain the physical address of the buffer when creating entries in the work queues (part of the mechanism used by verbs to send and receive data).

Note that there are basically two vendor libraries - one in the kernel and one in userspace. The former provides verbs functionality to other kernel modules like file systems (e.g. Lustre) or network protocol drivers (e.g. IP-over-InfiniBand), while the latter provides that functionality in userspace. Some operations cannot be done entirely in userspace, e.g. registering memory or opening/closing device contexts, and those are transparently passed to the kernel module by libibverbs.

Although technically RDMA over Converged Ethernet (RoCE, implemented in userspace as librxe) is not InfiniBand on the hardware level, the OpenFabrics stack is designed in such a way as to support RDMA-capable hardware other than InfiniBand HCAs, including RoCE and iWARP adapters.

See this summary from Intel on the topic of accessing InfiniBand on Linux for more details.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • Just a note about RoCE: while it isn't InfiniBand, it is pretty much InfiniBand transport sent over Ethernet, and it is in fact defined as part of the InfiniBand Architecture Specifications. – haggai_e Jun 30 '14 at 08:57
  • @haggai_e, thanks for the note - I've edited the text accordingly. – Hristo Iliev Jun 30 '14 at 10:19
  • Thank you very much for the detailed answer. I have accepted it! – byslexia Jul 01 '14 at 05:33
2

I'm not familiar with the librxe driver specifically, but in general libibverbs will handle request from an application or middleware library using it, and forward its calls to a provider library such as librxe. Provider libraries also use internal APIs in libibverbs to pass commands to the RDMA kernel modules (through the ib_uverbs module).

The RDMA stack is defined this way in order to allow direct hardware access from user-space.

EDIT: I'll try to explain about bypassing the copy from userspace to kernel and vice-vesa, following your comment.

An application using libibverbs will register a memory region using the ibv_reg_mr function. This function will invoke kernel commands in order to pin down the physical memory pages used by the virtual memory region passed to ibv_reg_mr. Afterwords, the kernel driver can access these pages directly without copying the information.

haggai_e
  • 4,689
  • 1
  • 24
  • 37
  • Thank you very much for the answer. It certainly helped to clarify the role of `libibverbs` for me. I am still confused about how this model will bypass the copy from kernel to userspace (in case that point was not clear in my question). I will look into this more and add to the answer, if I find it. – byslexia Jun 26 '14 at 01:24
  • Sure. I've edited my answer and tried to add an explanation about zero-copy. – haggai_e Jun 26 '14 at 06:57
0

User App-> Libverbs->librxe ( SoftRoce)-> ib_core.ko -> rdma_rxe.ko -> adapter

Path of Control Channel , which tells adapter where to do DMA from User Space. One Done , there is zero copy DMA by Adapter to transfer to Remote End.

Alok Prasad
  • 622
  • 7
  • 12