Why disable interrupts in XV6 scheduler?

Question

For the sched() function (proc.c) in XV6

why must we disable interrupts when doing a context switch? Is it because if interrupts are enabled, the sched function can be repeatedly invoked?
Why must ncli (the depth of pushcli nesting) be equal to 1?

   sched(void) {
      int intena;

      if(readeflags()&FL_IF)
        panic("sched interruptible");
      if(cp->state == RUNNING)
        panic("sched running");
      if(!holding(&ptable.lock))
        panic("sched ptable.lock");
      if(c->ncli != 1)
        panic("sched locks");

      intena = c->intena;
      swtch(&cp->context, &c->context);
      c->intena = intena;
    }

score 3 · Accepted Answer · answered Jun 12 '20 at 00:06

why must we disable interrupts when doing a context switch? Is it because if interrupts are enabled, the sched function can be repeatedly invoked?

Each task has state, which includes the state of the CPU and the state of various variables the OS uses to keep track of things (e.g. which task is currently running). The switch() function switches from one task's state to another; but it doesn't do this atomically. If an IRQ occurred while switch() is in the middle of switching from one task to another then the IRQ handler would see inconsistent state (e.g. the "which task is currently running" variable not matching the current virtual address space) which can/will lead to subtle bugs that are extremely difficult to reproduce (because you have to get the timing exactly right for the problem to happen) and extremely hard to find and fix.

Note that operating systems that support multiple CPUs can't rely on "IRQs disabled" to prevent reentrancy problems (e.g. disabling IRQs on one CPU won't prevent another CPU from calling sched() while it's already running). For this; XV6 (which does support multiple CPUs) uses a lock (the ptable.lock).

Why must ncli (the depth of pushcli nesting) be equal to 1?

From the CPU's perspective:

one task causes ncli to be set to 1
a task switch happens
another task causes ncli to decremented to zero

From a task's perspective:

the task causes ncli to be set to 1
many task switches happen (while other tasks are given CPU time) until the task is given CPU time again
the task causes ncli to decremented to zero

Both of these perspectives need to be compatible. For example, if one task causes ncli to be set to 2, then (after task switches) decrements ncli twice; then "from that task's perspective" it would be fine, but "from CPU's perspective" it would break (a different task would only decrement ncli once resulting in IRQs being disabled when they shouldn't be).

In other words, ncli must always be the same value. The value 1 was probably chosen because it's "good enough" for the majority of callers and using a higher value would add unnecessary overhead.

Why disable interrupts in XV6 scheduler?

1 Answers1