1

I am now trying to making progress on the MIT 6.828 (2018) course on Operating Systems Engineering, and I like it a lot. It is fun and challenging. I learned a lot of basic OS knowledge from this. Now I am struggling with this fine-grained locking challenge: https://pdos.csail.mit.edu/6.828/2018/labs/lab4/

But when I try to run: make run-primes-nox CPUS=4, I got failed when forking the child, I suspect it is the kernel stack data got corrupted or replaced during scheduling.

The parent sometimes won't recover from the fork system call

in scheduler, before making a round, acquire some lock(lock_scheduler();) to prevent other CPUs from accessing the process list.

    int i = 1, curpos = -1, k = 0;
    if (curenv)
        curpos = ENVX(curenv->env_id);
    lock_scheduler();
    for (; i < NENV; i++)
    {
        k = (i + curpos) % NENV;        // in a circular way
        if (envs[k].env_status == ENV_RUNNABLE)
        {
            env_run(&envs[k]);
        }
    }
    if (curenv != NULL && curenv->env_status == ENV_RUNNING)
    {
        env_run(curenv);
    }

    // sched_halt never returns
    sched_halt();

during sched_halt or about to env_run, we release the lock.

if (kernel_lock.locked && kernel_lock.cpu == thiscpu)
    unlock_kernel();
if (scheduler_lock.locked && scheduler_lock.cpu == thiscpu)
    unlock_scheduler();

when trapped into the kernel from interrupts or system call(explicitly with int $0x30), we lock the kernel with original big kernel lock(BKL), and before exiting the trap e.g. by env_run, we release the kernel lock.

void
trap(struct Trapframe *tf)
{
    // The environment may have set DF and some versions
    // of GCC rely on DF being clear
    asm volatile("cld" ::: "cc");

    // Halt the CPU if some other CPU has called panic()
    extern char *panicstr;
    if (panicstr)
        asm volatile("hlt");

    // Re-acqurie the big kernel lock if we were halted in
    // sched_yield()
    xchg(&thiscpu->cpu_status, CPU_STARTED);
    // Check that interrupts are disabled.  If this assertion
    // fails, DO NOT be tempted to fix it by inserting a "cli" in
    // the interrupt path.
    assert(!(read_eflags() & FL_IF));
    // only apply in trap
    lock_kernel();
......

Currently:

  • I keep the kernel_lock when trapped into the kernel from user space
  • I use the page_lock to protect the page_free_list when allocating or deallocating the memory
  • I acquire the scheduler_lock when getting into the sched_yield method, unlock it just before running any user process (env_pop_tf)

Sorry the information might not very sufficient, I have uploaded my workspace on Github here:

https://github.com/k0Iry/6.828_2018_mit_jos

here contains all my implementation from lab1 till lab4. Thanks for reviewing!

Way to reproduce that issue:

  1. git clone https://github.com/k0Iry/6.828_2018_mit_jos.git && cd 6.828_2018_mit_jos
  2. wget https://raw.githubusercontent.com/k0Iry/xv6-jos-i386-lab/master/labs/0001-trying-with-fine-grained-locks.patch
  3. git apply 0001-trying-with-fine-grained-locks.patch
  4. make run-primes-nox CPUS=4

you got the error during the processes' forking

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Xin
  • 113
  • 3
  • 9
  • Your code needs to be included directly in your question, not just linked to, so link rot can't make your question unclear for future readers. Leave the link to your full project, but include directly all the code you *think* is critical for people to see to be able to answer to make a [mcve] of your problem. – Peter Cordes Jul 26 '20 at 16:51
  • @PeterCordes I updated the post and added steps to reproduce the issue, thanks for pointing it out and I cannot include the code path here because it is too large and complex to add – Xin Jul 26 '20 at 17:03
  • Then don't expect too many people to be interested in debugging your code for you :/ The point of Stack Overflow is to build a library of *useful* questions with answers that can hopefully help future readers. Debugging a large codebase is very unlikely to be something that will match a problem some future reader is having. This is part of the point of requiring questions to reduce the problem to a [mcve]. – Peter Cordes Jul 26 '20 at 17:04
  • 1
    @PeterCordes I just tried to add more code snippet in the post, but I think still the best way is to run it and see, I know what you meant with minimal reproducible example stuff, but in my case (kernel implementation), I just can't provide something minimal, but yeah I tried to paste some code here. hope it helps :( – Xin Jul 26 '20 at 17:27
  • Yup, in practice debugging probably requires someone run it, but that looks like a reasonable effort to make this on topic for Stack Overflow. At least enough that some future reader could hopefully make sense of an answer, whatever the answer turns out to be, and/or maybe find this question by searching on similar code snippets. – Peter Cordes Jul 26 '20 at 17:29
  • @PeterCordes yep, good point :) hope there will be some luck, the truth is that concurrency programming is kind hard to me, especially with very little experience in this scope, I will continue try with this challenge and if I manage to make any progress I will be here to update this post. Well I guess now I have to post it here for some luck... anyway thanks for your comments and next time I know better way to post a question here :P – Xin Jul 26 '20 at 17:33
  • Concurrent programming is probably one of the hardest things in software engineering these days, especially low-level high-performance stuff. (Especially lock-free atomics, but correct fine-grained locking can be plenty hard to get right.) – Peter Cordes Jul 26 '20 at 17:44
  • @PeterCordes yeah I can see that... I just hope if someone else happens to dig this course then I can have a chat or email communication at least :p but I guess I need to move on a bit in the future.. – Xin Jul 26 '20 at 18:05

0 Answers0