What is the difference between nonpreemptive and preemptive kernels, when switching to user mode?

Question

I'm reading "Understanding the Linux Kernel, 3rd Edition", and in Chapter 5, Section "Kernel Preemption", it says:

All process switches are performed by the switch_to macro. In both preemptive and nonpreemptive kernels, a process switch occurs when a process has finished some thread of kernel activity and the scheduler is invoked. However, in nonpreemptive kernels, the current process cannot be replaced unless it is about to switch to User Mode.

I still don't see the difference here between non-preemptive and preemptive kernels, because any way you need to wait for the current process to switch to user mode.

Say there is a process p running in kernel mode, and whose time quantum expires, then the scheduler_tick() is called, and it sets the NEED_RESCHED flag of p. But schedule() is invoked only when p switch to user mode (right?).

So what if p never switches to user mode?

And if it switched to user mode but it takes a "long" time between the moment scheduler_tick() set NEED_RESCHED and the moment p actually switched to user mode - then it used more than its quantum?

What do you think of when you said "never switch to user mode"? It is unlikely that a process stays running in kernel mode for a very long time, either it is blocked or in waiting state or it runs for short time (or the kernel is badly designed). — Jean-Baptiste Yunès, Oct 23 '16 at 16:04
"But schedule() is invoked only when p switch to user mode (right?)." , no - wrong. That's the point, a process executing in the kernel can be preempted not only when it returns back to userspace. — nos, Oct 23 '16 at 16:26
@nos "the current process cannot be replaced unless it is about to switch to User Mode" — Mano Mini, Oct 23 '16 at 16:40
@manomino Which is only true for a non-preemptive kernel. It is not true for a preemptive kernel. — nos, Oct 23 '16 at 16:41
@ManoMini here is one place http://lxr.free-electrons.com/source/fs/eventfd.c#L247 , here's another http://lxr.free-electrons.com/source/lib/klist.c#L256 , and if you look around, the developers have found many more places where it's suitable to call schedule(), See also e.g. https://kernelnewbies.org/FAQ/Preemption and http://matroid.org/resources/KernelPreemption/PreemptiveKernel_v1.1_no_background.pdf — nos, Oct 23 '16 at 17:27

score 2 · Accepted Answer · answered Oct 23 '16 at 19:34

In a non-preemptive kernel, schedule() is called when returning to userspace (and wherever a system call blocks, also on the idle task).

In a preemptive kernel, schedule() is also called when returning from any interrupt, and also in a few other places, e.g. on mutex_unlock() slow path, on certain conditions while receiving network packets, ...

As an example, imagine a process A which issues a syscall which is interrupted by a device-generated interrupt, that is then interrupted by a timer interrupt:

 process A userspace → process A kernelspace → device ISR → timer ISR
                  syscall               device IRQ    timer IRQ

When the timer ISR ends, it returns to another ISR, that then returns to kernelspace, which then returns to userspace. A preemptive kernel checks if it needs to reschedule processes at every return. A non-preemptive kernel only does that check when returning to userspace.

score -1 · Answer 2 · answered Oct 23 '16 at 16:15

There are two ways to switch processes:

The process yields the CPU; or
The operating system says to the process "you're done for now."

The first occurs when the process executes some action that does not allow it to continue. For example, executes a SLEEP-type function or performs I/O (e.g. to a disk or to a terminal and has to wait for a user response).

The second occurs when the operating system's internal timer goes off and as part of handling the timer interrupt, the O/S determines that another process should run.

A kernel that only handles the first type of context switch is nonpreemptive. A kernel that handles both types of context switches is preemptive.

Note that yielding requires the execution of a system service. That requires triggering an exception to invoke the kernel mode system service handler.

Preemption required an interrupt. On most non-intel system, exceptions and interrupts are handled in the same way (Intel provides multiple ways to do exceptions). On most systems the process for returning from an interrupt and an exception are the same.

The context switch in both cases occurs BEFORE the process returns to user mode. When a process resumes execution the first thing is does is return from Kernel mode to user mode.

However, in nonpreemptive kernels, the current process cannot be replaced unless it is about to switch to User Mode.

This is a qualitative statement. The normal yield sequence is:

Trigger exception
Enter Kernel Mode
Dispatch to system service handler
Do stuff
Tell the O/S to yield. Context Switch out
Some event occurs to telling the O/S the process can run again.
OS Resumes the process in kernel mode.
Process Exits kernel mode
Process resumes on its merry way in user mode.

The book's statement is that nothing or very little occurs between #7 and #8. That is normally true but it is entirely possible the system service could put more work there. It just doesn't happen normally.

linux has always handled switching processes in both the ways you describe. But in the somewhat distant past(before v 2.5.4), a process in linux that was executing in kernel code, could not be preempted(replaced by the scheduler) except at one, and only specific code path in the kernel - people called this a non-preemptive kernel, to distinguish it from patches (and the current way linux works) that made the kernel able to preempt a process at many other points while it was executing kernel code. — nos, Oct 23 '16 at 16:34

What is the difference between nonpreemptive and preemptive kernels, when switching to user mode?

2 Answers2