Why do system calls return EFAULT instead of sending a segfault?

Question

To be clear, this is a design rather than an implementation question

I want to know the rationale behind why POSIX behaves this way. POSIX system calls when given an invalid memory location return EFAULT rather than crashing the userspace program (by sending a sigsegv), which makes their behavior inconsistent with userspace functions.

Why? Doesn't this just hide memory bugs? Is it a historical mistake or is there a good reason for it?

To me a sigsegv/sigbus make more sense too. Right now I'm playing with 2 custom syscalls which should have (slower) userspace emulations too. I don't see why the actual syscall and the emulation should behave differently if they're passed invalid buffers. Even POSIX seems to be of the opionion that users shouldn't have to care whether a system function is a real syscall or a userspace function. My related question:https://stackoverflow.com/questions/44239545/generating-segfault-from-a-custom-syscall/44251112 — Petr Skocik, May 30 '17 at 08:45
This is one reason that `strace ./a.out` is very handy for debugging toy programs that don't check errors on their syscalls. — Peter Cordes, Oct 06 '20 at 01:29

David Given · Answer 1 · 2017-05-05T20:25:37.883

6

Because system calls are executed by the kernel, not by the user program --- when the system call occurs, the user process halts and waits for the kernel to finish.

The kernel itself, of course, isn't allowed to seg fault, so it has to manually check all the address areas the user process gives it. If one of these checks fails, the system call fails with EFAULT. So in this situation a segmentation fault hasn't actually happening --- it's been avoided by the kernel explicitly checking to make sure all the addresses are valid. Hence it makes sense that no signal is sent.

In addition, if a signal were sent, there'd be no way the kernel could attach a meaningful program counter to the signal, the user process isn't actually executing when the system call is running. This means there'd be no way for the user process to produce decent diagnostics, restart the failed instruction, etc.

To summarise: mostly historical, but there is actual logic to the reasoning. Like EINTR, this doesn't make it any less irritating to deal with.

edited May 05 '17 at 20:25

answered Mar 08 '12 at 00:27

David Given

13,277
9
76
123

2

What do you mean by 'attach a meaningful program counter' to the signal? You mean it wouldn't know where to resume after user's signal handler was executed? Wouldn't that just be just after the system call? – Joseph Garvin Mar 09 '12 at 16:23
Yes, ignore the bit about resumption --- thinking about it some more, that's not actually relevant. (Because if the kernel's faking up a seg fault, it can easily fake the register state.) I think the main issue here is: the kernel is not sending you a seg fault because _no seg fault actually occurred_. – David Given Mar 09 '12 at 17:37
1

I suspect it's because they didn't want to do the work to make it work, and this way was easier. – user253751 Nov 19 '18 at 23:21

wildplasser · Answer 2 · 2018-11-21T15:00:32.630

Well, what would you want to happen. A system call is a request to the system. If you ask: "when does the ferry to Munchen leave?" would you like the program to crash, or to get return = -1 with errno = ENOHARBOR ? If you ask the sytem to put your car into your handbag, would you like to have your handbag destroyed, or a return of -1 with errno set to EBAGTOOSMALL ?

There is a technical detail: before or after syscalls,arguments to/from user/system -land have to be converted (copied) when entering/leaving the system call. Mostly for security reasons the system is very reluctant to write into user-space. (Linux has a copy_to_user_space function for this (and vice versa), which checks the credentials before doing the actual copying)

Why? Doesn't this just hide memory bugs?`

On the contrary. It allows your program to handle the error(impossible in this case), and terminate gracefully. But the program must check the return value from system calls and inspect errno. In the case of SIGSEGVE, there is very little for your program to do, so mapping EINVAL to SIGSEGVE would be a bad idea.

Systemcalls were designed to always return (or block indefinitely...), whether they succeed or fail.

And a technical aspect could be that {segmentation faults, buserror, floating point exception, ...} are (often) generated by hardware interrupts.

I already say what would make more sense in the question -- sending a segfault signal to the app. — Joseph Garvin, Mar 09 '12 at 16:23
Ah maybe, I misunderstood your question. There is a *technical* problem the copyfromuser(), copytouser() (in the linux case) are executed from kernel-mode, so the checking has to be performed "manually" by the kernel, there is no SEGVE possible, the kernel *could* perform this copy for you. So formally, it is not a segmentation violation (it *would* be, if performed from userspace) Also: from the view of the user process, a syscall is just a function with a return value. Signals are supposed to represent some asynchronous event. — wildplasser, Mar 09 '12 at 16:41
This is a bad answer. This could apply to any function, syscall or not. — user253751, Nov 19 '18 at 23:20

Why do system calls return EFAULT instead of sending a segfault?

2 Answers2

Related