Is there a portable way to discard a number of readable bytes from a socket-like file descriptor?

Question

Is there a portable way to discard a number of incoming bytes from a socket without copying them to userspace? On a regular file, I could use lseek(), but on a socket, it's not possible. I have two scenarios where I might need it:

A stream of records is arriving on a file descriptor (which can be a TCP, a SOCK_STREAM type UNIX domain socket or potentially a pipe). Each record is preceeded by a fixed size header specifying its type and length, followed by data of variable length. I want to read the header first and if it's not of the type I'm interested in, I want to just discard the following data segment without transferring them into user space into a dummy buffer.
A stream of records of varying and unpredictable length is arriving on a file descriptor. Due to asynchronous nature, the records may still be incomplete when the fd becomes readable, or they may be complete but a piece of the next record already may be there when I try to read a fixed number of bytes into a buffer. I want to stop reading the fd at the exact boundary between the records so I don't need to manage partially loaded records I accidentally read from the fd. So, I use recv() with MSG_PEEK flag to read into a buffer, parse the record to determine its completeness and length, and then read again properly (thus actually removing data from the socket) to the exact length. This would copy the data twice - I want to avoid that by simply discarding the data buffered in the socket by an exact amount.

On Linux, I gather it is possible to achieve that by using splice() and redirecting the data to /dev/null without copying them to userspace. However, splice() is Linux-only, and the similar sendfile() that is supported on more platforms can't use a socket as input. My questions are:

Is there a portable way to achieve this? Something that would work on other UNIXes (primarily Solaris) as well that do not have splice()?
Is splice()-ing into /dev/null an efficient way to do this on Linux, or would it be a waste of effort?

Ideally, I would love to have a ssize_t discard(int fd, size_t count) that simply removes count of readable bytes from a file descriptor fd in kernel (i.e. without copying anything to userspace), blocks on blockable fd until the requested number of bytes is discarded, or returns the number of successfully discarded bytes or EAGAIN on a non-blocking fd just like read() would do. And advances the seek position on a regular file of course :)

There isn't a portable alternative to reading that many bytes. I doubt you'll spot the overhead of reading them into a user-space buffer, nuisance though you regard it as being. Certainly, you should measure that there is a problem before expending effort on it — and when you expend effort, expect to have to use platform-specific solutions with `read()` as the fallback. If you know how much to skip, make sure you use multi-byte `read()` calls, and that the `read()` calls return as much as you expect (loop if they return less bytes until you reach the required number). — Jonathan Leffler, Jul 06 '18 at 23:34
@JonathanLeffler: Your second sentence in the above comment is exactly why I wrote my answer. In addition, I tried to show the better approaches, at the general idea level. — Nominal Animal, Jul 06 '18 at 23:47
Splicing to /dev/null is an interesting theoretical question. I too wonder how (in)efficient it is. — user185953, Mar 03 '21 at 09:46

score 4 · Answer 1 · answered Jul 06 '18 at 23:43

The short answer is No, there is no portable way to do that.

The sendfile() approach is Linux-specific, because on most other OSes implementing it, the source must be a file or a shared memory object. (I haven't even checked if/in which Linux kernel versions, sendfile() from a socket descriptor to /dev/null is supported. I would be very suspicious of code that does that, to be honest.)

Looking at e.g. Linux kernel sources, and considering how little a ssize_t discard(fd, len) differs from a standard ssize_t read(fd, buf, len), it is obviously possible to add such support. One could even add it via an ioctl (say, SIOCISKIP) for easy support detection.

However, the problem is that you have designed an inefficient approach, and rather than fix the approach at the algorithmic level, you are looking for crutches that would make your approach perform better.

You see, it is very hard to show a case where the "extra copy" (from kernel buffers to userspace buffers) is an actual performance bottleneck. The number of syscalls (context switches between userspace and kernel space) sometimes is. If you sent a patch upstream implementing e.g. ioctl(socketfd, SIOCISKIP, bytes) for TCP and/or Unix domain stream sockets, they would point out that the performance increase this hopes to achieve is better obtained by not trying to obtain the data you don't need in the first place. (In other words, the way you are trying to do things, is inherently inefficient, and rather than create crutches to make that approach work better, you should just choose a better-performing approach.)

In your first case, a process receiving structured data framed by a type and length identifier, wishing to skip unneeded frames, is better fixed by fixing the transfer protocol. For example, the receiving side could inform the sending side which frames it is interested in (i.e., basic filtering approach). If you are stuck with a stupid protocol that you cannot replace for external reasons, you're on your own. (The FLOSS developer community is not, and should not be responsible for maintaining stupid decisions just because someone wails about it. Anyone is free to do so, but they'd need to do it in a manner that does not require others to work extra too.)

In your second case, you already read your data. Don't do that. Instead, use an userspace buffer large enough to hold two full size frames. Whenever you need more data, but the start of the frame is already past the midway of the buffer, memmove() the frame to start at the beginning of the buffer first.

When you have a partially read frame, and you have N unread bytes from that left that you are not interested in, read them into the unused portion of the buffer. There is always enough room, because you can overwrite the portion already used by the current frame, and its beginning is always within the first half of the buffer.

If the frames are small, say 65536 bytes maximum, you should use a tunable for the maximum buffer size. On most desktop and server machines, with high-bandwidth stream sockets, something like 2 MiB (2097152 bytes or more) is much more reasonable. It's not too much memory wasted, but you rarely do any memory copies (and when you do, they tend to be short). (You can even optimize the memory moves so that only full cachelines are copied, aligned, since leaving almost one cacheline of garbage at the start of the buffer is insignificant.)

I do HPC with large datasets (including text-form molecular data, where records are separated by newlines, and custom parsers for converting decimal integers or floating-point values are used for better performance), and this approach does work well in practice. Simply put, skipping data already in your buffer is not something you need to optimize; it is insignificant overhead compared to simply avoiding doing the things you do not need.

There is also the question of what you wish to optimize by doing that: the CPU time/resources used, or the wall clock used in the overall task. They are completely different things.

For example, if you need to sort a large number of text lines from some file, you use the least CPU time if you simply read the entire dataset to memory, construct an array of pointers to each line, sort the pointers, and finally write each line (using either internal buffering and/or POSIX writev() so that you do not need to do a write() syscall for each separate line).

However, if you wish to minimize the wall clock time used, you can use a binary heap or a balanced binary tree instead of an array of pointers, and heapify or insert-in-order each line completely read, so that when the last line is finally read, you already have the lines in their correct order. This is because the storage I/O (for all but pathological input cases, something like single-character lines) takes longer than sorting them using any robust sorting algorithm! The sorting algorithms that work inline (as data comes in) are typically not as CPU-efficient as those that work offline (on complete datasets), so this ends up using somewhat more CPU time; but because the CPU work is done at a time that is otherwise wasted waiting for the entire dataset to load into memory, it is completed in less wall clock time!

If there is need and interest, I can provide a practical example to illustrate the techniques. However, there is absolutely no magic involved, and any C programmer should be able to implement these (both the buffering scheme, and the sort scheme) on their own. (I do consider using resources like Linux man pages online and Wikipedia articles and pseudocode on for example binary heaps doing it "on your own". As long as you do not just copy-paste existing code, I consider it doing it "on your own", even if somebody or some resource helps you find the good, robust ways to do it.)

Answer is good even for "multicast streaming" scenario where better solutions translate to: If the discard-gaps are huge, it is more efficient to stop listening altogether for a while. If they are small, it is more efficient but tedious to read big chunks and filter it yourself. If you don't want tedious, then consider a higher-level language that will do it for you. Correct? — user185953, Mar 03 '21 at 09:58

Is there a portable way to discard a number of readable bytes from a socket-like file descriptor?

1 Answers1