epoll order of events from epoll_wait

Question

I have ported a program over to epoll from select to increase the number of sockets we can handle. I have added the sockets to the epoll FD and can read and write happily.

However, I am concerned about potential starvation of sockets even though I am using level triggered events. The scenario I am worried about is when there are more sockets ready than epoll_event structures. I know that the next time I call epoll_wait it will give me the rest of them, but I wonder what order I get them in with reguards to who didn't make the cut the last time vs this time.

An example: Say I have 10 sockets connected and added to the epoll fd. I only have enough memory for 5 epoll_event structures. Let's assume that in the time between each epoll_wait, all 10 sockets receive data. The first epoll_wait will return 5 epoll_event structures for processing, lets say it's sockets 1-5. I process those 5 sockets and while I am doing so, more data comes in and all 10 sockets have more data to be read. I enter the epoll_wait again and get 5 more epoll_event structures.

My question is what 5 sockets will I get on the second call to epoll_wait. Will it be sockets 1-5 because they were added to the epoll FD first? Or will I get sockets 6-10 because those events were raised before more data came in on sockets 1-5?

Essentially, is epoll_wait like a FIFO queue or does it simply scan an internal list of sockets (and thereby favoring the first sockets in the list).

EDIT: This is Linux kernel v4.9.62

@Someprogrammerdude: Other Unices will implement wrapper APIs for compatibility. E.g.: [FreeBSD's Linux binary compatibility feature](https://www.freebsd.org/doc/handbook/linuxemu.html); And even Windows has [wepoll](https://github.com/piscisaureus/wepoll) — jxh, May 17 '18 at 22:11
The documentation is unclear on this point. One hopes that the kernel *queues* epoll events, so that the postulated second `epoll_wait()` retrieves events on file descriptors 6-10, but it looks like I'd have to study the kernel sources to be sure (and since it's undocumented, it might change). — John Bollinger, May 17 '18 at 22:14
@jxh I read your answer to the linked solution. If the events are indeed kept in a linked list, then epoll will work like a FIFO and my starvation concern is not a problem after all. However, since this is a reliance on an undefined behavior, I'll leave it open for other input in case it has changed in the last 5 years. — Mr. Rogers, May 17 '18 at 22:16
@Mr.Rogers: Your question and that question have different starting points, but the answer is the same. Unclear if dup-hammer should be applied. In terms of future directions, it is unlikely default behavior will change, but I believe there is room to allow the priority of events to be modified via configuration or other weight assignment method. — jxh, May 17 '18 at 22:17

mtk · Accepted Answer · 2018-07-04T08:05:43.637

The observation by @jxh about the behavior is correct, and the behavior is long established (and was originally intended, if I correctly recall my email conversations with the implementer, Davide Libenzi, many years ago). It's unfortunate that it has not been documented so far. But, I've fixed that for the upcoming manual pages release, where epoll_wait(2) will carry the text:

If more than maxevents file descriptors are ready when epoll_wait() is called, then successive epoll_wait() calls will round robin through the set of ready file descriptors. This behavior helps avoid starvation scenarios, where a process fails to notice that additional file descriptors are ready because it focuses on a set of file descriptors that are already known to be ready.

I see your commits to the man pages clarifying this behavior, I will mark your response as the answer as it is the most authoritative. I commend your efforts in maintaining the man pages, not a task I would wish on my enemy. — Mr. Rogers, Jul 02 '18 at 20:21

score 4 · Answer 2 · answered May 17 '18 at 22:33

4

Perusing through the source file for epoll, one sees that the ready events are maintained in a linked list. Events are removed from the head of the list and added to the end of the list.

Based on that, the answer is that the descriptor order is based on the order in which they became ready.

answered May 17 '18 at 22:33

jxh

69,070
8
110
193

I duplicated my answer from a related question. This answer is community wiki'd. – jxh May 17 '18 at 22:34
The fly-in-the-ointment seems to be the *"I only have enough memory for 5 epoll_event structures."* part of the question -- which is where I am a bit lost. And that has to do with the "out of memory" condition. I presume the program runs in user-space, so if the epoll code is queuing events in a linked list in kernel-space, then it is possible that the list continues to add events as they occur. However if the list is held in user-space and the out-of-memory condition is raised, I don't see how some error condition isn't raised resulting in them not being saved. Am I thinking wrong? – David C. Rankin May 17 '18 at 22:48
@DavidC.Rankin: I read that as a hypothetical limitation to illustrate the issue he is trying to ask. Assuming that the kernel implementation of the `epoll` interface is only limited by the FD limit of the system, but the user space can be limited arbitrarily (say via `ulimit` or `sysconfig`), then he just wants to know what happens when he re-issues `epoll_wait` after clearing away the events from the previous call to `epoll_wait`. – jxh May 17 '18 at 22:54
@DavidC.Rankin: Err... The FD limit on the process would get enforced in the kernel. I somehow forgot that new sockets actually pile onto the accepting socket until after a proper FD is created on a call to `accept`. – jxh May 17 '18 at 23:10
Thank you. I was picking though the epoll code -- but it will take me quite a while to digest exactly how that would work. It just caught me as a very intriguing question. – David C. Rankin May 17 '18 at 23:17
@DavidC.Rankin: The other reason to limit the size of the epoll_event array is to put hard bound on the event loop to allow other things to go (assuming a single threaded implementation). – jxh May 17 '18 at 23:18
Yes, one of the aspects of the epoll implementation that I had not considered (and did somewhat make my eyes glaze over) was the multi-level of locking needed to handle the different states the code would have to deal with in working between kernel space and user space to be threadsafe. Thanks again for your help pealing back the layers of this onion. – David C. Rankin May 18 '18 at 02:16

epoll order of events from epoll_wait

2 Answers2

Linked