1

I wrote a simple lock-free node stack (Delp[hi XE4, Win7-64, 32-bit app) where I can have multiple 'stacks' and pop/push nodes between them from various threads simultaneously. It works 99.999% of the time but eventually glitch under a stress test using all CPU cores.

Stripped-down, it comes down to this (not real/compiled code): Nodes are :

type      POBNode          = ^TOBNode;
[volatile]TOBNode          = record
                    [volatile]Next   : POBNode;
                              Data   : Int64;
                             end;

Simplified stack :

type TOBStack  = class
                  private
                   [volatile]Head:POBNode; 
                  function Pop:POBNode;
                  procedure Push(NewNode:POBNode);
                 end;

procedure TOBStack.Push(NewNode:POBNode);
var zTmp : POBNode;
begin;
    repeat
     zTmp:=InterlockedCompareExchangePointer(Pointer(Head),nil,nil);(memory fenced-read*)
     NewNode.Next:=zTmp;
     if InterlockedCompareExchangePointer(Head,NewNode,zTmp)=zTmp
        then break (*success*)
        else continue;
    until false;
end;

function TOBStack.Pop:POBNode;
begin;
 repeat
  Result:=InterlockedCompareExchangePointer(Pointer(Head),nil,nil);(memory fenced-read*)
  if Result=nil
     the exit;
  NewHead:=Result.Next;
  if InterlockedCompareExchangePointer(Pointer(Head),NewHead,Result)=Result
     then break (*Success*)
     else continue;(*Fail, try again*)
 until False;
end; 

I have tried many variations on this but cannot get it to be stable. If I create a few threads that each have a stack and they all push/pop to/from a global stack, it eventually glitch, but not quickly. Only when I stress it for minutes on end from multiple threads, in tight loops.

I cannot have hidden bugs in my code, so I need more advice than to ask a specific question to get this running 100% error-free, 24/7. Does the code above look fine for a lock-free thread-safe stack? What else can I look at? This is impossible to debug normally as the errors occur at various places, telling me there is a pointer or RAM corruption happening somewhere. I also get duplicate entries, meaning that a node that was popped of one stack, then later returned to that same stack, is still on top of the old stack... Impossible to happen according to my algorithm? This lead me to believe it's possible to violate Delphi/Windows InterlockedCompareExchange Methods or there is some hidden knowledge I have yet to be revealed. :) (I also tried TInterlocked)

I have made a complete test case that can be copied from ftp.true.co.za. In there I run 8 threads doing 400 000 push/pops each and it usually crashes (safely due to checks/raised exceptions) after a few cycles of these tests, sometimes many many test cycles complete before one suddenly crash.

Any advice would be appreciated.

Regards Anton E

AntonE
  • 53
  • 3
  • InterlockedCompareExchange works perfectly fine if you pass it correctly aligned adress. For a working lock-free stack, queue and dynamic queue, use the OtlContainers unit from the www.omnithreadlibrary.com project. – gabr Jan 20 '14 at 11:11
  • The `InterlockedXXX` functions are known to work. The problem is that your code does not. – David Heffernan Jan 20 '14 at 11:18
  • Where is the documentation for this `[volatile]` attribute? – David Heffernan Jan 20 '14 at 11:21
  • 1
    @DavidHeffernan, see [`New Delphi Compiler Attributes`](http://docwiki.embarcadero.com/RADStudio/XE4/en/What's_New_in_Delphi_and_C%2B%2BBuilder_XE4#New_Delphi_Compiler_Attributes). – LU RD Jan 20 '14 at 11:37
  • @LURD Do you know more about this? As I understand it, variables under the x86 and x64 compilers have always been `[volatile]`. Does this only affect the ARM compilers? – David Heffernan Jan 20 '14 at 11:39
  • @DavidHeffernan, I don't know more than this. Perhaps scanning the source would reveal something. – LU RD Jan 20 '14 at 11:41
  • @LURD Well, it would need surely to be the source of the compiler, and we don't have that. As I understand it, on x86 and x64 compilers the use of `[volatile]` is not necessary, and indeed the memory fenced reads in the code above are not needed because of the strong x86 and x64 memory model. – David Heffernan Jan 20 '14 at 11:43
  • @DavidHeffernan, in `System.Pas` the [volatile] attribute is used in TMonitor,TObject and some more classes. Protecting FRefCount variables. It is probably something ARC brought in, but all platforms seems to be affected. – LU RD Jan 20 '14 at 11:55
  • 2
    [volatile] is a no-op with the old compiler, as far as I know (citation needed). Win32 and Win64 are therefore not affected by it. – gabr Jan 20 '14 at 12:07
  • 3
    Without spending much time, I can say that your code suffers at least from the ABA problem (http://en.wikipedia.org/wiki/ABA_problem). Please use existing and well-tested code (e.g. OmniThreadLibrary), it is free. – gabr Jan 20 '14 at 12:08
  • @LURD I'm sure that gabr is correct. The x86 and x64 compilers have never optimised non-local variables into registers. – David Heffernan Jan 20 '14 at 12:12
  • 1
    As @gabr says, using proven code is a good move here. Testing multi-threaded code is hard. Expect your tests to be inadequate. Did you know that Windows ships with the very code that you need: http://msdn.microsoft.com/en-us/library/windows/desktop/ms684121.aspx – David Heffernan Jan 20 '14 at 20:07
  • As I understand it the [volatile] has to do with visibility. e.g. in a tight loop a variable might be optimized by the compiler to stay in a register. The volatile attribute tells it to 'not trust' the variable and always read it from memory. It is just a compiler directive, it does noting for Windows or the memory model. – AntonE Jan 20 '14 at 21:55
  • Thanks for the comments on using OTL and others, but I'm working on a project that need native code for future platform porting plans. – AntonE Jan 20 '14 at 22:07
  • @David "The x86 and x64 compilers have never optimised non-local variables into registers." With the key on 'non-local', but do you mean inside a method(local stack variables), or object-scope? E.g. when multiple threads run an object method TMyObj.Hello and read TMyObj.MyVal in a tight loop, would that not have been (possibly) optimized into a register? I'll include it anyways, it does not make a difference and might be needed in the next compiler for concurrent code. I suppose it is (really) needed more in device drivers and such when using 'absolute' variables then? – AntonE Jan 20 '14 at 22:34
  • as @gabr says, [volatile] is a no op on windows – David Heffernan Jan 20 '14 at 22:54

1 Answers1

4

At first I was skeptical of this being an ABA problem as indicated by gabr. It seemed to me that: if one thread looks at the current Head, and another thread pushes then pops; you're happy to still operate on the same Head in the same way.

However, consider this from your Pop method:

NewHead:=Result.Next;
if InterlockedCompareExchangePointer(Pointer(Head),NewHead,Result)=Result
  • If the thread is swapped out after the first line.
  • A value for NewHead is stored in a local variable.
  • Then another thread successfully pops the node this thread was targetting.
  • It also manages to push the same node back, but with a different value for Next before the first thread resumes.
  • The second line will pass the comparand check allowing head to receive the NewHead value from the local variable.
  • However, the current value for NewHead is incorrect, thereby corrupting your stack.

There's a subtle variation on this problem not even covered by your test app. This problem isn't checked in your test app because you aren't destroying any nodes until the end of your test.

  • Look at current head
  • Another thread pops some nodes.
  • The nodes are destroyed, and new nodes created and pushed.
  • By the time your "looking thread" is active again, it could be looking at an entirely different node that is coincidentally at the same address.
  • If you're popping, you might assign a garbage pointer to Head.

Apart form the above...
Looking at your test app there's also some really dodgy code. E.g.

You generate a "random number": J:=GetTickCount and 7;(*Get a 'random' number 0..7*).

  • Do you realise how fast computers are?
  • Do you realise that GetTickCount will generate reams of duplicates in a tight loop?
  • I.e. the numbers you generate will be nothing like random.
  • And when comments don't agree with code, my spidey-sense kicks in.

You're allocating memory of a hard-coded size: GetMem(zTmp,12);(*Allocate new node*).

  • Why aren't you using SizeOf?
  • You're using a multi-platform compiler.... The size of that structure can vary.
  • There is absolutely zero reason to hard-code the size of the structure.

Right now, given these two examples, I wouldn't be entirely confident that there isn't also an error in your test code.

Community
  • 1
  • 1
Disillusioned
  • 14,635
  • 3
  • 43
  • 77
  • Thanks for the well explained response. In the scenario you decribed, it makes sense that I cause the problem by re-using nodes. In real-world situation this stack will be used mostly with 'PopAll') and nodes will be passed down 'down the line' before being returned to an 'available for reuse' (pre-allocated) ring-buffer. So re-using (enough) nodes should not be a problem. I think my test tight loops caused a problem that would not have been there. Thanks for pointing it out. – AntonE Jan 20 '14 at 21:43
  • 1
    @AntonE I wouldn't count on it. Remember you're not just at risk of reusing nodes, there's also the risk of nodes that happen to be created at the same address. There may also be other risk scenarios. – Disillusioned Jan 20 '14 at 22:12