How is splice() actually doing zero-copy in Linux?

Question

I'm new to the concept of zero-copy but from what I understand, it is a way not to copy anything from kernel buffers to user buffers and pass data directly between the 2 kernel buffers. In this way, the CPU does not have to do 2 copies of data from the kernel buffer to user buffer and back to kernel buffer. All the CPU does now is it copies the data between the 2 kernel buffers, thereby reducing the no. of copies done by the CPU to 1. In some cases with Linux 2.4 and above, even the data doesn't have to be duplicated in the kernel buffers, only the location and length of data to be transferred are passed to the socket buffer and DMA does the copying. Hence the name zero-copy.

Two ways to do zero-copy in Linux are via sendfile() or via splice() syscalls.

While sendfile() has the inherent limitation of copying data only from the page cache of the file to the socket buffer, splice() on the other hand has no such limitation. But the problem is that in splice() either of the file descriptors should be a pipe. So the kernel has to first copy the data from the source file descriptor to the pipe and then copy the data back from the pipe to the destination kernel buffer. The number of copies by the CPU involved here is 2.

So my questions are:

How is then splice() solving our original problem of reducing the number of copies done by CPU?
Is zero-copy only possible between socket and file and vice versa and not file to file or socket to socket?

Rachid K. · Answer 1 · 2022-09-09T06:48:08.670

1

If you look at the NOTES section of the manual of splice(), it is said that "actual copies are generally avoided":

Though we talk of copying, actual copies are generally avoided. The kernel does this by implementing a pipe buffer as a set of reference-counted pointers to pages of kernel memory. The kernel creates "copies" of pages in a buffer by creating new pointers (for the output buffer) referring to the pages, and increasing the reference counts for the pages: only pointers are copied, not the pages of the buffer.

Concerning sendfile(), the manual specifies that "the out fd must be a socket" is no longer true since Linux 2.6.33:

In Linux kernels before 2.6.33, out_fd must refer to a socket. Since Linux 2.6.33 it can be any file. If it is a regular file, then sendfile() changes the file offset appropriately.

edited Sep 09 '22 at 06:48

answered Sep 09 '22 at 06:40

Rachid K.

4,490
3
11
30

So from what I understand the kernel divides the entire data into a # of pages such that # of pages * size of each data page >= total size of data. Each page in the kernel has a pointer and what the CPU does is copy the pointers from 1 place to another using the pipe buffer. Going to the next pointer is like referring to the next page. The DMA chip actually does the copying from kernel page to kernel page. Is this the correct way to think about it? – Aritra Sur Roy Sep 09 '22 at 19:03
@AritraSurRoy: There is no copying of data when pages are appended to or removed from a pipe buffer, so DMA is not involved there. DMA comes into play when the input FD refers to a disk file whose blocks are not already present in the page cache and so must be faulted in from disk or when the output FD refers to a disk file whose dirtied blocks need to be evicted from the page cache to free up RAM. In other words it's the interface between the page cache and the storage controller where DMA is occurring. The pipe splicing operations are unconcerned with that. – Matt Whitlock Jun 26 '23 at 22:49

How is splice() actually doing zero-copy in Linux?

1 Answers1