I wrote a simple lock-free node stack (Delp[hi XE4, Win7-64, 32-bit app) where I can have multiple 'stacks' and pop/push nodes between them from various threads simultaneously. It works 99.999% of the time but eventually glitch under a stress test using all CPU cores.
Stripped-down, it comes down to this (not real/compiled code): Nodes are :
type POBNode = ^TOBNode;
[volatile]TOBNode = record
[volatile]Next : POBNode;
Data : Int64;
end;
Simplified stack :
type TOBStack = class
private
[volatile]Head:POBNode;
function Pop:POBNode;
procedure Push(NewNode:POBNode);
end;
procedure TOBStack.Push(NewNode:POBNode);
var zTmp : POBNode;
begin;
repeat
zTmp:=InterlockedCompareExchangePointer(Pointer(Head),nil,nil);(memory fenced-read*)
NewNode.Next:=zTmp;
if InterlockedCompareExchangePointer(Head,NewNode,zTmp)=zTmp
then break (*success*)
else continue;
until false;
end;
function TOBStack.Pop:POBNode;
begin;
repeat
Result:=InterlockedCompareExchangePointer(Pointer(Head),nil,nil);(memory fenced-read*)
if Result=nil
the exit;
NewHead:=Result.Next;
if InterlockedCompareExchangePointer(Pointer(Head),NewHead,Result)=Result
then break (*Success*)
else continue;(*Fail, try again*)
until False;
end;
I have tried many variations on this but cannot get it to be stable. If I create a few threads that each have a stack and they all push/pop to/from a global stack, it eventually glitch, but not quickly. Only when I stress it for minutes on end from multiple threads, in tight loops.
I cannot have hidden bugs in my code, so I need more advice than to ask a specific question to get this running 100% error-free, 24/7. Does the code above look fine for a lock-free thread-safe stack? What else can I look at? This is impossible to debug normally as the errors occur at various places, telling me there is a pointer or RAM corruption happening somewhere. I also get duplicate entries, meaning that a node that was popped of one stack, then later returned to that same stack, is still on top of the old stack... Impossible to happen according to my algorithm? This lead me to believe it's possible to violate Delphi/Windows InterlockedCompareExchange Methods or there is some hidden knowledge I have yet to be revealed. :) (I also tried TInterlocked)
I have made a complete test case that can be copied from ftp.true.co.za. In there I run 8 threads doing 400 000 push/pops each and it usually crashes (safely due to checks/raised exceptions) after a few cycles of these tests, sometimes many many test cycles complete before one suddenly crash.
Any advice would be appreciated.
Regards Anton E