2

It is mentioned here in the PTX documentation that bar.sync and bar.arrive barrier synchronization instructions can be used as below:

bar.sync      a{, b};
bar.arrive    a, b;

Where

Source operand a specifies a logical barrier resource as an immediate constant or register with value 0 through 15. Operand b specifies the number of threads participating in the barrier.

It also shows an example where a producer-consumer model is established using these instructions:

// Producer code places produced value in shared memory.
st.shared [r0],r1; 
bar.arrive 0,64;
...

// Consumer code, reads value from shared memory 
bar.sync 0,64; 
ld.shared r1,[r0];
...

I do not quite get the purpose of operand b in bar.arrive. While such operand in bar.sync can be used to control the number of threads involved in the barrier and wait until the thread count is reached, its use for bar.arrive is not clear to me.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Farzad
  • 3,288
  • 2
  • 29
  • 53

1 Answers1

2

Two things happen when all threads have arrived at a barrier:

  1. All waiting threads are allowed to proceed beyond the barrier.
  2. The barrier is re-initialized so it is ready to be used again.

You are probably thinking only of 1., which can only happen at a bar.sync instruction. Therefore it it obvious a bar.sync needs to know the number of threads participating in the barrier. However, the barrier can also be released at a bar.arrive which therefore also needs to know the number of participating threads.

Having said that, it is undocumented what happens if participating warps or even threads disagree on the number of threads involved. This could be seen as an opportunity to a daring inquisitive mind to find possible new (and unsupported!) synchronization constructs through reverse-engineering.

tera
  • 7,080
  • 1
  • 21
  • 32
  • Can you please provide an example where releasing a barrier at a `bar.arrive` would be useful? – Farzad Jun 12 '17 at 17:47
  • Whenever you want to use the same barrier more than once in your code. – tera Jun 12 '17 at 18:11
  • My point is a design with `bar.arrive` executed by some threads or warps and with no `bar.sync` seems functionally redundant. Therefore, the barrier restart can be signaled only by `bar.sync`. Unless the existence of multiple `bar.sync` executed by multiple warps for a barrier is possible. For instance, a `bar.arrive 0,128` by one warp, and then `bar.sync 0,64` by another warp and `bar.sync 0,128` by two other warps. – Farzad Jun 12 '17 at 19:23
  • 2
    You are right that all threads executing `bar.arrive` with no `bar.sync` would be a no-op, so there needs to be at least one `bar.sync` somewhere, and the total number of threads could be taken from there. That would however require the barrier hardware to have a register to store the total number of threads in case the last warp arriving at the barrier executes a `bar.arrive`. I suspect that register just doesn't exist. – tera Jun 12 '17 at 22:30