Solve Cortex M7 priority inversion deadlock by surrendering context from a higher priority IRQ to a lower priority IRQ

Question

Background

I have a custom bare metal mutex primitive written for the STM32F7 (Arm Cortex M7) processor per the Barrier and Litmus Test Cookbook from ARM, using the LDREX and STREX instructions. I use this to control critical sections in my code.

It seems to work well on multiple IRQs with differing priorities, and solves inversion deadlock with a timeout loop. If the lock spins 100 times on acquisition, I assume it's held by a lower priority context and break with an error return flag, returning from the IRQ context. This means the critical section for that IRQ instance never runs, though.

Question

I'm wondering if I missed anything in the CMSIS HAL or STM32F7 reference/programming manual that would allow me (or help me) to easily pause execution in the blocked higher priority IRQ, switch context to the lower priority one, finish execution and free the lock, then return to the higher priority one?

Solutions I've considered/tried

Obviously just switching to an RTOS but it's not an option, it's an existing codebase that I don't entirely own.
I read the sections on "Exception entry and return", etc, and can maybe do it manually with the stack. Seems complex though and I'd have to keep track of which context actually holds the lock.
Ditch the timeout loop and use a separate timer, checking for priority inversion by examining the stack, and raising the lower priority IRQ to allow it to complete and stop blocking the higher priority IRQ. (Complex, and approaching the territory of just writing a scheduler.)
Using WFE and SEV instructions to give up context. I'm not 100% sure, but I don't think this will work the way I think it will, and is more for multiprocessor systems?
Accept that this is as good as it gets without significantly more effort.

Mutex code and Usage

Compiled with gcc-arm-none-eabi using -mcpu=cortex-m7 -mfpu=fpv5-d16 -mfloat-abi=hard -mthumb.

static inline int acquireLock(unsigned int *lock) 
{
  unsigned int tempStore = 0;
  unsigned int lockFlag = 1;
  unsigned int timeout = 0;
  unsigned int result = 0;

  __asm__ volatile(                        //
      "Loop1%=:                      \n\t" // label for main spinlock loop

      // Lock acquisition spin loop.
      "add %[tim], %[tim], #1        \n\t" // add 1 to timeout counter
      "ldrex %[ts], %[lock]          \n\t" // read lock's current state
      "cmp %[ts], #0                 \n\t" // check if 0 (lock is available)
      "it eq                         \n\t" // only try to store if lock is clear
      "strexeq %[ts], %[lf], %[lock] \n\t" // try to grab lock if it is availble

      // Loop exit logic block.
      "cmp %[ts], #0                 \n\t" // check we got the lock?
      "beq Loop2%=                   \n\t" // if we got lock, quit loop
      "cmp %[tim], #100              \n\t" // else, check timeout counter
      "bgt Loop2%=                   \n\t" // quit loop if timeout > 100
      "b Loop1%=                     \n\t" // else go back to start of spin loop

      // Check and set return value (success) of lock acquisition
      "Loop2%=:                      \n\t" // label for loop exit
      "cmp %[ts], #0                 \n\t" // check if we got lock (vs timeout)
      "ite eq                        \n\t" // conditional store of return value
      "moveq %[res], #0              \n\t" // return 0 if we got lock
      "movne %[res], #1              \n\t" // else return 1 if we timed out
      "dmb                           \n\t" // mem barrier for later RWMs

      : [ lock ] "+m"(*lock), [ ts ] "+l"(tempStore), [ tim ] "+l"(timeout),
        [ res ] "=l"(result)
      : [ lf ] "l"(lockFlag)
      : "memory");

  return result;
}

and an example of usage, in an IRQ context:

if (acquireLock(&lock) == 0) 
{
  something_critical++;
  releaseLock(&lock);
} 
else 
{
  return;
}

Deadlock Example

If it helps, here's a backtrace of the deadlock when I disable the timeout counter in the spinlock loop. You can see TIM6 preempted execution of TIM7 while it was in the process of releasing the lock (but hadn't completed yet).

As clearly stated in the post, I don't have control over the architecture at that level, and am looking for solutions given these constraints. Priority inversion deadlock is not an impossible (or even uncommon) problem. If you don't have constructive comments/insight, I would appreciate leaving it open to others that may. — Tegan, Feb 23 '22 at 14:43
The only idea that popped into my head, and I have not though this through completely, is to use the PendSV exception to continue the high priority IRQ handler when it fails to get the lock. By making the PendSV exception priority less than the priority of any IRQ exception, then when the lock is not acquired, setting up and pending the PendSV exception will allow the high priority exception to end, allowing the lower priority exception to complete and then the PendSV handler runs to execute the "work" of the high priority exception. I don't think it would require any tracking of context. — andy mango, Feb 23 '22 at 19:24