0

I'm trying to find the cause of a nasty kernel panic triggered by Chromium Legacy, a project to backport modern versions of Chromium to old versions of macOS (10.7 – 10.10). The kernel panic occurs when the kqueue_scan_continue function is called with the wait_result parameter set to THREAD_RESTART.

In XNU 2422 (OS X 10.9.5), kqueue_scan_continue looks like this:

static void
kqueue_scan_continue(void *data, wait_result_t wait_result)
{
    thread_t self = current_thread();
    uthread_t ut = (uthread_t)get_bsdthread_info(self);
    struct _kqueue_scan * cont_args = &ut->uu_kevent.ss_kqueue_scan;
    struct kqueue *kq = (struct kqueue *)data;
    int error;
    int count;

    /* convert the (previous) wait_result to a proper error */
    switch (wait_result) {
    case THREAD_AWAKENED:
        kqlock(kq);
        error = kqueue_process(kq, cont_args->call, cont_args, &count,
            current_proc());
        if (error == 0 && count == 0) {
            wait_queue_assert_wait((wait_queue_t)kq->kq_wqs,
                KQ_EVENT, THREAD_ABORTSAFE, cont_args->deadline);
            kq->kq_state |= KQ_SLEEP;
            kqunlock(kq);
            thread_block_parameter(kqueue_scan_continue, kq);
            /* NOTREACHED */
        }
        kqunlock(kq);
        break;
    case THREAD_TIMED_OUT:
        error = EWOULDBLOCK;
        break;
    case THREAD_INTERRUPTED:
        error = EINTR;
        break;
    default:
        panic("%s: - invalid wait_result (%d)", __func__,
            wait_result);
        error = 0;
    }

    /* call the continuation with the results */
    assert(cont_args->cont != NULL);
    (cont_args->cont)(kq, cont_args->data, error);
}

It's easy to see why this leads to a kernel panic. The switch statement expects wait_result to be either THREAD_AWAKENED, THREAD_TIMED_OUT, or THREAD_INTERRUPTED. If it's something else, such as THREAD_RESTART, the default case is selected, and the kernel panics.

In macOS Sierra, Apple added an additional case to this switch statement to handle THREAD_RESTART:

    case THREAD_RESTART:
        error = EBADF;
        break;

When I add this code to older kernels and recompile XNU, they no longer panic while running Chromium Legacy.

My question is, why did it take Apple until macOS Sierra to handle THREAD_RESTART in this function? THREAD_RESTART is a valid value for wait_result_t, and is returned by various internal kernel functions.

The most obvious explanation is "Apple made a mistake", and that may be all it is! However, it feels like too obvious a mistake to go unnoticed for years in highly-sensitive kernel code!

Does this look like a simple human error, or is there a reason Apple may have thought that handling THREAD_RESTART was unnecessary? For example, is calling kqueue_scan_continue with THREAD_RESTART supposed to be impossible?


Just for reference, here's the Chromium Legacy GitHub issue where some smart people helped me figure out a lot of the information in this question.

Wowfunhappy
  • 168
  • 7
  • I don't really understand what sort of answer you're expecting here? The function in question is internal to the kernel, 3rd party developers will never have direct access to it, so the likelihood of getting anyone at Apple to comment on it publicly is minimal. It sounds like you're hitting this panic in a specific syscall? You don't really say specifically, but I assume this is `kevent()`? A syscall from user space should never cause a kernel panic, so if it does, that's a kernel bug. So does your question boil down to "Why did Apple have a bug in their code?" – pmdj Mar 13 '22 at 13:45
  • Yeah, it basically does boil down to that! I'm hoping that someone with familiarity with XNU can speak to why Apple wouldn't have handled THREAD_RESTART in this function. – Wowfunhappy Mar 13 '22 at 18:05
  • As to the syscall—it's presumably `kevent`, but we don't even know for sure, which is why I'm not actually asking how to fix the problem. It's true that I'm somewhat fishing for information here. – Wowfunhappy Mar 13 '22 at 18:28
  • Apple doesn’t develop XNU in the open, and employees are NDA’d up to the eyeballs, so getting an authoritative answer is pretty much zero. An oversight most likely. Presumably when it was originally written, the continuation could never resume with that code. Then something far away in the codebase changed the invariant, and this location wasn’t updated, and the situation in which it happened was rare enough not to be caught in Apple’s own testing. – pmdj Mar 13 '22 at 18:32
  • Don’t forget the kqueue/kevent system was ported from FreeBSD, so it wasn’t even written by anyone at Apple to begin with. – pmdj Mar 13 '22 at 18:35
  • `THREAD_RESTART` is a fairly unusual resumption code, so with sufficient sleuthing it should be possible to work out to which situation it corresponds and what triggers it. Presumably some kind of signal sent to the thread or process. – pmdj Mar 13 '22 at 18:38

0 Answers0