Creating Mutexes with Binary Semaphore

Question

I'm working with a simple system that does NOT have mutexes, but rather a limited array of hardware binary semaphores. Typically, all multithreading is been done with heavy Semaphore techniques that makes code both poor in performance and difficult to write correctly without deadlocks.

A naive implementation is to use one semaphore globally in order to ensure atomic access to a critical section. However, this means that unrelated objects (even of different types) will block if any critical section is being accessed.

My current solution to this problem is to use a single global semaphore to ensure atomic access to a guard byte that then ensures atomic access to the particular critical section. I currently have this so far:

while (true) {
    while (mutexLock == Mutex::Locked) {
    } //wait on mutex
    Semaphore semaLock(SemaphoreIndex::Mutex); //RAII semaphore object
    if (mutexLock == Mutex::Unlocked) {
        mutexLock = Mutex::Locked;
        break;
    }
} //Semaphore is released by destructor here
// ... atomically safe code
mutexLock = Mutex::Unlocked;

I have a few questions: Is this the best way to approach this problem? Is this code thread-safe? Is this the same as a "double checked lock"? If so, does it suffer from the same problems and therefore need memory barriers?

EDIT: A few notes on the system this is being implemented on...

It is a RISC 16-bit processor with 32kB RAM. While it has heavy multithreading capabilities, its memory model is very primitive. Loads and stores are atomic, there is no caching, no branch prediction or branch target prediction, one core with many threads. Memory barriers are mostly for the compiler to know it should reload memory into general purpose registers, not for any hardware reason (no cache)

This is, obviously, non-portable code. So we'd have to know your platform's rules. — David Schwartz, Oct 04 '13 at 19:48
Which in particular? The semaphore is an atomic lock at the index given when constructed, and the destruction releases across all threads. — Sam Cristall, Oct 04 '13 at 19:51
Knowing its memory visibility rules would be a start. Knowing if it provides memory barriers and, if so, what kinds and when they're needed would be good. Is it multi-processor or multi-core? Does it reorder memory operations? Does it have cores that share execution resources? You need detailed platform knowledge to construct good synchronization primitives. — David Schwartz, Oct 04 '13 at 19:53
For example, if this is a modern x86 platform, this code is a *disaster*. Your loop on the `mutexLock` will cause the CPU to mispredict the branch when `mutexLock` is released. So at the worst possible time, when you absolutely need to go as quickly as possible, you completely blow up the pipelines. And what about hyper-threading? That loop will consume CPU resources like mad. What if the lock is held by the other virtual core? How is it supposed to make forward progress efficiently with that loop consuming limited execution resources? You need detailed platform knowledge to do this right. — David Schwartz, Oct 04 '13 at 19:57
This is on a simple 16-bit RISC MCU with 32kB RAM. There's no cache, no branch prediction, no branch target prediction, no hyperthreading, no operation re-ordering. The processor has many threads sharing pipeline hardware, but they are independent. — Sam Cristall, Oct 04 '13 at 20:03
@Troy That's the only hardware support for critical sections (no mutexes or locks). I don't believe there are any software constructs that can emulate what the binary semaphores do (since any software version would be non-atomic) — Sam Cristall, Oct 04 '13 at 20:30
With 32KB ram you are running on a minimal system, certainly not a big OS. So you probably are pretty close to the hardware. You need a testandset operation, and you probably want to implement your own semaphores/mutexes. Use the limited hardware binary semaphore to protect your own group of mutexes and semaphores, and protect access to them with the hardware semaphore. The you can protect your code sections with these derivative resources. — ChuckCottrill, Oct 05 '13 at 06:11
@ChuckCottrill That's kind of what I'm shooting for with the code I posted. My processor does not have an atomic test and set, so I'm checking the mutex guard, then creating a pseudo atomic test and set using the hardware semaphore. Does that kind of make sense? — Sam Cristall, Oct 05 '13 at 18:57
The `while (mutexLock == Mutex::Locked)` seems problematic. If the threads are prioritized and a higher priority thread reaches that statement while a lower priority thread is in the critical section, a deadlock will occur. If the threads are not prioritized, then the thread should yield inside that loop. — D Krueger, Oct 07 '13 at 16:28

score 1 · Answer 1 · answered Nov 27 '13 at 11:35

Even if the processor has heavy multithreading capabilities (not sure what you mean by that btw.), but is a single core processor, it still means that only one thread can execute at any one time.

A task switching system must be employed (which would normally run in a privileged mode) Using this system you must be able to define critical sections to atomically execute a (software implemented) mutex lock/unlock.

When you say "one core, many threads" does that mean that you have some kind of kernel running on your processor? The task switching system will be implemented by that kernel. It might pay to look through the documentation of your kernel or ask your vendor for any tips.

Good luck.

We actually *are* the vendor haha, and I'm implementing a library to help with multi-threading on it. There is no kernel, as it is a simple MCU with the novel addition that up to N instructions are executed in parallel per instruction tick. This is implemented in hardware, not software, with no true task switching. Unfortunately, the hardware only comes with N*2 semaphore locks for handling multi-threading, which is not typically enough, and thus spawned this question. — Sam Cristall, Nov 27 '13 at 15:41

score 1 · Answer 2 · answered Jul 05 '17 at 17:07

Right now we still don't have that much information about your system (for example, what kind of registers are available for each instruction in parrallel? do you use bank architecture?, how many simultaneous instructions can you actually execute?) but hopefully what I suggest will help you

If I understand your situation you have a piece of hardware that does not have true cores, but simply MIMD ability via a vectorized operation (based on your reply). with a It is a "RISC 16-bit processor with 32kB RAM" where:

Loads and stores are atomic, there is no caching, no branch prediction or branch target prediction, one core with many threads

The key here is that loads and stores are atomic. Note you won't be able to do larger than 16bit load and stores atomically, since they will be compiled to two separate atomic instructions (and thus not being atomic itself).

Here is the functionality of a mutex:

attempt to lock
unlock

To lock, you might run into issues if each resource attempts to lock. For example say in your hardware N = 4 (number of processes to be run in parrallel). If instruction 1 (I1) and I2 try to lock, they will both be successful in locking. Since your loads and stores are atomic, both processes see "unlocked" at the same time, and both acquire the lock.

This means you can't do the following:

 if mutex_1 unlocked:
      lock mutex_1

which might look like this in an arbitrary assembly language:

 load arithmetic mutex_addr
 or arithmetic immediate(1)  // unlocked = 0000 or 1 = 0001, locked = 0001 or 1 = 0001 
 store mutex_addr arithmetic // not putting in conditional label to provide better synchronization. 
 jumpifzero MUTEXLABEL arithmetic

To get around this you will need to have each "thread" either know if its currently getting a lock some one else is or avoid simultaneous lock access entirely.

I only see one kind of way this can be done on your system (via flag/mutex id checking). Have a mutex id associated with each thread for each mutex it is currently checking, and check for all other threads to see if you can actually acquire a lock. Your binary semaphores don't really help here because you need to associate an individual mutex with a semaphore if you were going to use it (and still have to load the mutex from ram).

A simple implementation of the check every thread unlock and lock, basically each mutex has an ID and a state, in order to avoid race conditions per instruction, the current mutex being handled is identified well before it is actually acquired. Having the "identify which lock you want to use" and "actually try to get the lock" come in two steps stops accidental acquisition on simultaneous access. With this method you can have 2^16-1 (because 0 is used to say no lock was found) mutexes and your "threads" can exist on any instruction pipe.

// init to zero
volatile uint16_t CURRENT_LOCK_ATTEMPT[NUM_THREADS]{0};
// make thread id associated with priority
bool tryAcqureLock(uint16_t mutex_id, bool& mutex_lock_state){
    if(mutex_lock_state == false){
        // do not actually attempt to take the lock until checked everything.  
        // No race condition can happen now, you won't have actually set the lock
        // if two attempt to acquire the same lock at the same time, you'll both 
        // be able to see some one else is as well. 
        CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = mutex_id;
        //checking all lower threads, need some sort of priority 
        //predetermined to figure out locking. 
        for( int i = 0; i < MY_THREAD_ID; i++ ){
            if((CURRENT_LOCK_ATTEMPT[i] == mutex_id){
                //clearing bit. 
                CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
                return false;
            }
        }
        // make sure to lock mutex before clearing which mutex you are currently handling
        mutex_lock_state = true;
        CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
        return true;
    }
    return false;
}

// its your fault if you didn't make sure you owned the lock in the first place
// if you did own it, theres no race condition, because of atomic store load. 
// if you happen to set the state while another instruction is attempting to 
// acquire the lock they simply wont get the lock and no race condition occurs
bool unlock(bool& mutex_lock_state){
    mutex_lock_state = false;
}

If you want more equal access of resources you could change indexing instead of being based on i = 0 to i < MY_THREAD_ID, you randomly pick a "starting point" to circle around back to MY_THREAD_ID using modulo arithmetic. IE:

bool tryAcqureLock(uint16_t mutex_id, bool& mutex_lock_state, uint16_t per_mutex_random_seed){
    if(mutex_lock_state == false){

        CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = mutex_id;    
        //need a per thread linear congruence generator for speed and consistency
        std::minstd_rand0 random(per_mutex_random_seed)
        for(int i = random() % TOTAL_NUM_THREADS; i != MY_THREAD_ID i = (i + 1) % TOTAL_NUM_THREADS)
        {
        //same as before
        }
        // if we actually acquired the lock
        GLOBAL_SEED = global_random() // use another generator to set the next seed to be used
        mutex_lock_state = true;
        CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
        return true;
    }
    return false;
}

In general your lack of test and set ability really throws a wrench into everything, meaning you are forced to use other algorithms for mutexes. For more information on other algorithms that you can use for non test and set architectures check out this SO post, and these wikipedia algorithms which only rely on atomic loads and stores:

All of these algorithms basically decompose into checking a set of flags to see if you can access the resource safely by going through every one elses flags.

Question is quite old but I appreciate the response! Is there not a race in the code though, or am I not seeing something? Let's say `MY_THREAD_ID` is extremely large for a thread, and it enters the loop passing `CURRENT_LOCK_ATTEMPT[0]`, seeing it is not taken, and continues. While it is looping, another thread with `MY_THREAD_ID == 0` enters and immediately resolves and takes the mutex. Then `MY_THREAD_ID == Large` finishes its loop and also takes the mutex, having not seen the change to `CURRENT_LOCK_ATTEMPT[0]`. — Sam Cristall, Jul 07 '17 at 16:32
@SamCristall Crap, you're right... I've solved similar problems before, but its been a long time since I've used this, I wrote code similar to this that was based off of the Eisenberg Mcgquire algorithm a long time ago. Even adding back a statement that took "if its been set since we tried" still gets you a race condition. I'll re-write when I get the chance to sit down with this again. I'll ping you here when I do. — Krupip, Jul 07 '17 at 17:00

Creating Mutexes with Binary Semaphore

2 Answers2