Right now we still don't have that much information about your system (for example, what kind of registers are available for each instruction in parrallel? do you use bank architecture?, how many simultaneous instructions can you actually execute?) but hopefully what I suggest will help you
If I understand your situation you have a piece of hardware that does not have true cores, but simply MIMD ability via a vectorized operation (based on your reply). with a It is a "RISC 16-bit processor with 32kB RAM" where:
Loads and stores are atomic, there is no caching, no branch prediction or branch target prediction, one core with many threads
The key here is that loads and stores are atomic. Note you won't be able to do larger than 16bit load and stores atomically, since they will be compiled to two separate atomic instructions (and thus not being atomic itself).
Here is the functionality of a mutex:
To lock, you might run into issues if each resource attempts to lock. For example say in your hardware N = 4 (number of processes to be run in parrallel). If instruction 1 (I1) and I2 try to lock, they will both be successful in locking. Since your loads and stores are atomic, both processes see "unlocked" at the same time, and both acquire the lock.
This means you can't do the following:
if mutex_1 unlocked:
lock mutex_1
which might look like this in an arbitrary assembly language:
load arithmetic mutex_addr
or arithmetic immediate(1) // unlocked = 0000 or 1 = 0001, locked = 0001 or 1 = 0001
store mutex_addr arithmetic // not putting in conditional label to provide better synchronization.
jumpifzero MUTEXLABEL arithmetic
To get around this you will need to have each "thread" either know if its currently getting a lock some one else is or avoid simultaneous lock access entirely.
I only see one kind of way this can be done on your system (via flag/mutex id checking). Have a mutex id associated with each thread for each mutex it is currently checking, and check for all other threads to see if you can actually acquire a lock. Your binary semaphores don't really help here because you need to associate an individual mutex with a semaphore if you were going to use it (and still have to load the mutex from ram).
A simple implementation of the check every thread unlock and lock, basically each mutex has an ID and a state, in order to avoid race conditions per instruction, the current mutex being handled is identified well before it is actually acquired. Having the "identify which lock you want to use" and "actually try to get the lock" come in two steps stops accidental acquisition on simultaneous access. With this method you can have 2^16-1 (because 0 is used to say no lock was found) mutexes and your "threads" can exist on any instruction pipe.
// init to zero
volatile uint16_t CURRENT_LOCK_ATTEMPT[NUM_THREADS]{0};
// make thread id associated with priority
bool tryAcqureLock(uint16_t mutex_id, bool& mutex_lock_state){
if(mutex_lock_state == false){
// do not actually attempt to take the lock until checked everything.
// No race condition can happen now, you won't have actually set the lock
// if two attempt to acquire the same lock at the same time, you'll both
// be able to see some one else is as well.
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = mutex_id;
//checking all lower threads, need some sort of priority
//predetermined to figure out locking.
for( int i = 0; i < MY_THREAD_ID; i++ ){
if((CURRENT_LOCK_ATTEMPT[i] == mutex_id){
//clearing bit.
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return false;
}
}
// make sure to lock mutex before clearing which mutex you are currently handling
mutex_lock_state = true;
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return true;
}
return false;
}
// its your fault if you didn't make sure you owned the lock in the first place
// if you did own it, theres no race condition, because of atomic store load.
// if you happen to set the state while another instruction is attempting to
// acquire the lock they simply wont get the lock and no race condition occurs
bool unlock(bool& mutex_lock_state){
mutex_lock_state = false;
}
If you want more equal access of resources you could change indexing instead of being based on i = 0
to i < MY_THREAD_ID
, you randomly pick a "starting point" to circle around back to MY_THREAD_ID using modulo arithmetic. IE:
bool tryAcqureLock(uint16_t mutex_id, bool& mutex_lock_state, uint16_t per_mutex_random_seed){
if(mutex_lock_state == false){
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = mutex_id;
//need a per thread linear congruence generator for speed and consistency
std::minstd_rand0 random(per_mutex_random_seed)
for(int i = random() % TOTAL_NUM_THREADS; i != MY_THREAD_ID i = (i + 1) % TOTAL_NUM_THREADS)
{
//same as before
}
// if we actually acquired the lock
GLOBAL_SEED = global_random() // use another generator to set the next seed to be used
mutex_lock_state = true;
CURRENT_LOCK_ATTEMPT[MY_THREAD_ID] = 0;
return true;
}
return false;
}
In general your lack of test and set ability really throws a wrench into everything, meaning you are forced to use other algorithms for mutexes. For more information on other algorithms that you can use for non test and set architectures check out this SO post, and these wikipedia algorithms which only rely on atomic loads and stores:
All of these algorithms basically decompose into checking a set of flags to see if you can access the resource safely by going through every one elses flags.