Understanding "A software combining tree barrier with optimized wakeup" algorithm

Question

I am reading this pseudo code for a barrier synchronization algorithm from this paper, and I could not fully understand it.

The goal of the code is to create a barrier for multiple threads (threads can't pass the barrier unless all threads have completed) using something called "software combining tree" (not sure what it means)

Here is the pseudo code (though I encourage you to look at the article as well)

type node = record
      k : integer             // fan-in of this node
      count : integer         // initialized to k
      locksense : Boolean     // initially false
      parent : ^node          // pointer to parent node; nil if root

  shared nodes : array [0..P-1] of node
      // each element of nodes allocated in a different memory module or cache line
  processor private sense : Boolean  := true
  processor private mynode : ^node    // my group's leaf in the combining tree

  procedure combining_barrier 
      combining_barrier_aux (mynode)      // join the barrier
      sense := not sense                  // for next barrier

  procedure combining_barrier_aux (nodepointer : ^node)
      with nodepointer^ do
          if fetch_and_decrement (&count) = 1     // last one to reach this node
              if parent != nil
                  combining_barrier_aux (parent)
              count := k                          // prepare for next barrier
              locksense := not locksense          // release waiting processors
          repeat until locksense = sense

I understand that it implies building a binary tree but I didn't understand a few things.

Is P the number of threads?
What is k? What is "fan-in of this node"
The article mentions that threads are organized as groups on the leaves of the tree, what groups?
Is there exactly one node for each thread?
How do I get "my group's leaf in the combining tree"?

Haven't looked at the paper, but your example seems incomplete: The constant, P, is only used in the declaration of the nodes variable, and nodes does not seem to be used anywhere. — Solomon Slow, Feb 13 '14 at 18:35
@jameslarge that's what intrigues me, the paper seems to assume you know what P is and what is a software combining tree — Eran Medan, Feb 13 '14 at 18:44
It's possible we're reading this for the same class lol. Yeah, I think it's ... not nice if this purports to be pseudocode that certain details are not filled in (for example how the shared array nodes is being used). — Alex Leibowitz, Sep 26 '22 at 20:22

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

Is P the number of threads?

YES

What is k? What is "fan-in of this node".

This is called the ary-ness of the tree being constructed. This is a binary tree and hence k = 2. The idea is that for a smaller number of processors, the binary tree will suffice. But as the number of processors increases, then the levels in the tree will grow a lot. This is balanced by increasing the ary-ness by increasing the value of k. This will essentially enable more than two processors to be part of a leaf or a group. As the system scales to thousands of processors with interconnect, this may be important. The downside to this is the increased contention. As more processors become part of the same tree, they will be spinning on the same variable.

The article mentions that threads are organized as groups on the leaves of the tree, what groups?

The threads are organized into groups equal to number of leaves. The number of leaves is essentially the number of threads divided by the ary-ness or k. So for a set of 8 threads, with k=2, the number of leaves will be 4. This allows the threads in a group to spin on a single variable and also to distribute the spinning across multiple variables, rather than a single shared variable as in the basic centralized barrier algorithm.

Is there exactly one node for each thread?

The answer is NO. Of course there are at least as many nodes as the leaves. For a 8-thread problem, there will be at 4 nodes for the 4 leaves. Now, after this flat level, the "winners"(thread that comes last) will climb up to its parent in a recursive manner. The number of levels will be log P to base k. For each parent, there will be a node, eventually climbing up to the true root or the parent. for e.g. for the 8 thread problem, there will be 4 + 2 + 1 = 7 nodes.
How do I get "my group's leaf in the combining tree"?

This is a little tricky part. There is a formula based on modulus and some integer division. However, I have not seen a publicly available implementation. sorry, I can't reveal what I have seen only in a class, as that may not be appropriate. May be a google search or someone else can fill in this.

7 nodes for 8 threads? How does the 8th thread know when to continue? — ParkerD, Oct 07 '22 at 00:35
Also, `shared nodes : array [0..P-1] of node`, if P is the number of threads, why is it used in the paper to seemingly mean the number of nodes? — ParkerD, Oct 07 '22 at 16:27

score 0 · Answer 2 · answered Feb 13 '14 at 18:44

Just a guess, not an real answer

The nodes probably constitute a complete binary tree with P/2 leaves where P is number of threads. Two threads are assigned to each leaf.

The threads at each leaf meet up, and decide which will be "active" and which will be "passive". The passive thread waits while the active thread combines both of their inputs and moves up to the parent node to meet up with the active thread from the sibling leaf. Then those two elect an active member, and the process repeats, moving up the tree until one thread is elected "active" at the root.

The active thread at the root does the final computation, producing a result, and then Then the whole thing unwinds, and all of the passive threads are notified of the result, and released to do whatever it is they're going to do.

Understanding "A software combining tree barrier with optimized wakeup" algorithm

2 Answers2