The key to doing almost impossible stuff like this is to use InterlockedCompareExchange. (This is the name Win32 uses but any multithreaded-capable platform will have an InterlockedCompareExchange equivalent).
The idea is, you make a copy of the structure (which must be small enough to perform an atomic read (64 or if you can handle some unportability, 128 bits on x86).
You make another copy with your proposed update, do your logic and update the copy, then you update the "real" structure using InterlockedCompareExchange. What InterlockedCompareExchange does is, atomically make sure the value is still the value you started with before your state update, and if it is still that value, atomically updates the value with the new state. Generally this is wrapped in an infinite loop that keeps trying until someone else hasn't changed the value in the middle. Here is roughly the pattern:
union State
{
struct
{
short a;
short b;
};
uint32_t volatile rawState;
} state;
void DoSomething()
{
// Keep looping until nobody else changed it behind our back
for (;;)
{
state origState;
state newState;
// It's important that you only read the state once per try
origState.rawState = state.rawState;
// This must copy origState, NOT read the state again
newState.rawState = origState.rawState;
// Now you can do something impossible to do atomically...
// This example takes a lot of cycles, there is huge
// opportunity for another thread to come in and change
// it during this update
if (newState.b == 3 || newState.b % 6 != 0)
{
newState.a++;
}
// Now we atomically update the state,
// this ONLY changes state.rawState if it's still == origState.rawState
// In either case, InterlockedCompareExchange returns what value it has now
if (InterlockedCompareExchange(&state.rawState, newState.rawState,
origState.rawState) == origState.rawState)
return;
}
}
(Please forgive if the above code doesn't actually compile - I wrote it off the top of my head)
Great. Now you can make lockless algorithms easy. WRONG! The trouble is that you are severely limited on the amount of data that you can update atomically.
Some lockless algorithms use a technique where they "help" concurrent threads. For example, say you have a linked list that you want to be able to update from multiple threads, other threads can "help" by performing updates to the "first" and "last" pointers if they are racing through and see that they are at the node pointed to by "last" but the "next" pointer in the node is not null. In this example, upon noticing that the "last" pointer is wrong, they update the last pointer, only if it still points to the current node, using an interlocked compare exchange.
Don't fall into a trap where you "spin" or loop (like a spinlock). While there is value in spinning briefly because you expect the "other" thread to finish something - they may not. The "other" thread may have been context switched and may not be running anymore. You are just eating CPU time, burning electricity (killing a laptop battery perhaps) by spinning until a condition is true. The moment you begin to spin, you might as well chuck your lockless code and write it with locks. Locks are better than unbounded spinning.
Just to go from hard to ridiculous, consider the mess you can get yourself into with other architectures - things are generally pretty forgiving on x86/x64, but when you get into other "weakly ordered" architectures, you get into territory where things happen that make no sense - memory updates won't happen in program order, so all your mental reasoning about what the other thread is doing goes out the window. (Even x86/x64 have a memory type called "write combining" which is often used when updating video memory but can be used for any memory buffer hardware, where you need fences) Those architectures require you to use "memory fence" operations to guarantee that all reads/writes/both before the fence will be globally visible (by other cores). A write fence guarantees that any writes before the fence will be globally visible before any writes after the fence. A read fence will guarantee that no reads after the fence will be speculatively executed before the fence. A read/write fence (aka full fence or memory fence) will make both guarantees. Fences are very expensive. (Some use the term "barrier" instead of "fence")
My suggestion is to implement it first with locks/condition variables. If you have any trouble with getting that working perfectly, it's hopeless to attempt doing a lockless implementation. And always measure, measure, measure. You'll probably find the performance of the implementation using locks is perfectly fine - without the incertainty of some flaky lockless implementation with a natsy hang bug that will only show up when you're doing a demo to an important customer. Perhaps you can fix the problem by redefining the original problem into something more easily solved, perhaps by restructuring the work so bigger items (or batches of items) are going into the collection, which reduces the pressure on the whole thing.
Writing lockless concurrent algorithms is very difficult (as you've seen written 1000x elsewhere I'm sure). It is often not worth the effort either.