1

Priority inversion is a common and somewhat old problem. Those who dealt with OS process scheduling, especially if there are real-time requirements, are familiar with it. There are few well-known solutions to the problem, each having its pros and cons:

  • Disabling all interrupts to protect critical sections
  • A priority ceiling
  • Priority inheritance
  • Random boosting

It doesn't matter which method is chosen to cope with priority inversion; all of those are relatively easy to implement in the OS kernel given that applications use well-defined interface for synchronizing shared resources. For instance, if a process locks a mutex using, for example, pthread_mutex_lock, the OS is well aware of that fact because deep down this function does a system call (i.e. futex on Linux) . When the kernel serves this request, it has a complete and clear picture of who is waiting on what, and can decide how to handle priority inversion best.

Now, imagine that kernel doesn't know when process is locking/unlocking a mutex. This could happen, for instance, if atomic CPU instruction is used to implement a mutex (as in “lock-free” algorithms). Then it becomes possible for a low-priority process to grab a lock and get suspended from executing because of a higher-priority task. Then, when a higher priority task is scheduled, it would simply burn the CPU trying to lock a “spin-lock”. A deadlock like that would render the whole system useless.

Given the scenario above and the fact that we cannot change the program to not use atomic operations to synchronize access to shared resources, the problem boils down to detecting when code is trying to do so.

I had a few somewhat vague heuristic ideas that are both hard to implement and could give false positives. Here they are:

  1. Look at the program counter register once in a while and try to detect that code simply burns a CPU in a tight loop. If the code is spotted in that place N times, suspend the process and let other lower-priority processes chance to run and unlock the mutex. This methods is way too far from ideal and can give way too much false positives.
  2. Have a hard-limit to how much time a process can run. This immediately drops hard real-time capabilities of the scheduler, but it could work. The problem is, however, that in "deadlock" cases, the high-priority process would waste all its time window trying to acquire a busy resource.
  3. I don't know if this is even possible, but another idea is to intercept/interpose atomic CPU instructions to have scheduler be aware of locking/unlocking attempts. In other words, essentially turning atomic CPU operations into some sort of system calls. Somewhat close in its mechanics to how virtual page mapping is created when MMU signals a page fault.

What do you think of the above ideas? What other ways of detecting such a code could you possibly think of?

  • I question the presumptions of the question: if you have a lock-free algorithm, no lock will appear for more than one atomar operation and thus there is no spin-locking in lock-free algorithms. In addition, it is a bad idea to use spin-locking outside the kernel. Maybe you should describe your background problem to justify your approach? – Matthias Jun 03 '13 at 06:48
  • @Matthias: I said “as in lock-free”, to refer to atomic instructions (like compare and swap). It is possible to implement mutual exclusion similar to that provided by `pthread_mutex_t` using those as well. This is not about whether someone should do this or not, or how bad or how good this approach is. I am looking for a way to detect these cases given you cannot change or even look at the source code. You may ask why detect it? To avoid priority inversion when scheduling tasks. –  Jun 03 '13 at 12:09
  • No, I don't ask why to detect it, I ask why anybody should use spin locks at user level. I guess, there must be a deeper reason, because it sounds senseless. But with the reason behind, maybe one can find better solutions. – Matthias Jun 03 '13 at 12:22
  • BTW, are you allowed to modify _binaries_? – Matthias Jun 03 '13 at 12:22
  • In addition: Do you try to detect *deadlocks* or *unbounded priority inversions*? These are different problems. – Matthias Jun 03 '13 at 12:24
  • @Matthias: I think the second is more appropriate description. If, say, all of those tasks where running at the same time, there would have been no deadlocks. As for modifying the binary — I think that would be possible after the code is loaded into memory. –  Jun 03 '13 at 12:28
  • Are you writing the scheduler, or are you just a process? – rlb Jun 04 '13 at 10:00
  • @rlb: This is mostly a user-space code that interposes a lot of function/system calls to the kernel and “coordinates” what happens. But having a piece of running in the kernel is no problem if it helps. –  Jun 04 '13 at 11:46

2 Answers2

1

While I still question your setting (see comments), I see your third approach as the most promissing, since it provides the most precise information. I can figure two mechanisms that follow the main idea:

  1. Assumption: you know the adress of the locks. You might find them by inspecting your binary for the typical spin-lock pattern at your system, e.g., loop: CMPXCHG <adr>, JRZ loop.
    Then, you mark <adr> as "missing" or "not accessible" and hook the MMU service routines.
  2. Assumptions: you may in addition change the text segment of your binary.
    Then, you can exchange the critical spin-lock by calls to regular mutexes or some own routine that does the bookkeeping (beside the actual locking).

As policy, you should prefer priority ceiling over priority inheritance since it avoids deadlocks as a side-effect. You can apply it, since you know the (potential) locks of a thread anyway.

For a more elaborated solution, more information on hardware, OS, and toolchain would be needed.

In addition be aware, that the basic approach of using atomar user-level spin-locks may not work for several of todays memory coherence models.

Matthias
  • 8,018
  • 2
  • 27
  • 53
  • x86_64 Linux (2.6 and up), running on Intel X5570 but I'd prefer something that works across similar CPUs as well. –  Jun 03 '13 at 13:15
  • 1
    Wouldnt marking the page with addr invalid, mean that any access to unrelated neighbouring data would also cause your MMU hook to fire, even if properly handled would be slow? Couldnt you patch the istream with the cmpxchg/jrz to jmp to a routine that does same sequence, but includes a timeslice yield on every attempt? (would also love a pointer to more about your last sentence for personal reading) – rlb Jun 04 '13 at 12:06
  • @rlb: I think replacing opcodes is a possible way to go. I am going to look into how to achieve that. –  Jun 05 '13 at 01:17
  • @rlb: Your first point is true, if the MMU (as it is the case for the target plattform) provides a demand paging only. However, such _false sharing_ only introduce a small overhead. For other architectures, one could use tagged memory to avoid that overhead. Maybe, a tweaking of the memory segements can reduce the overhead (collecting all locks within one page), too. Your second point does not apply: either you have a `yield` in the sequence, but the you got a different bit pattern (due to other jump relative distance), or you have not, the it is a case that should be catched. – Matthias Jun 05 '13 at 06:20
1

I think your option 1 may have more merit than you give it credit. I am assuming that you have a several processes that you may need to monitor and that you do not know the target address of the spinlocks.

Rather than random external sampling, you may find it easier to hook the scheduler entry point and collect your stats at this point, the advantage being you are in the process address space and caches are hot. I dont know much about linux scheduler, but have done this sort of thing on OpenVMS in the past. There is often two entry points to a scheduler, voluntary (waiting for IO etc) and involuntary, spin lock issues will almost always be involuntary, so this should reduce your work rate.

Obviously, at this point you have the interupted PC, but it seems Intel chips also have some performance monitoring counters you could use, BTS (branch trace store) and maybe PEBS, but these might be 'non free' in performance terms. Information such as branch trace would very quickly show tight loops, which you could then use to check the actual instructions causing the loop (again, already in cache) and see if it is conditional atomic instructions or 'normal' work such as summing an array.

If you didnt write the code, it is always possible that non interlocked instructions have been used in some way too, hopefully not!

While i think some of the on chip monitoring functions could really help here, you could also simply check if the pc was roughly the same on the end of the last M schedule periods and force it to skip one period, pretty simple but not targeted.

While you could do all of this as a second proces looking in, it may not be as responsive as a scheduler based approach, although probably far safer and less likely to crash the system. You would still need to get the last PC from the scheduler at any rate, and the second process will need to be higher or equal priority to the monitored process.

rlb
  • 1,674
  • 13
  • 18