11

Consider this code:

// global
std::atomic<bool> run = true;

// thread 1
while (run) { /* do stuff */ }

// thread 2
/* do stuff until it's time to shut down */
run = false;

Do I need the overhead associated with the atomic variable here? My intuition is that the read/write of a boolean variable is more or less atomic anyway (this is a common g++/Linux/Intel setup) and if there is some write/read timing weirdness, and my run loop on thread 1 stops one pass early or late as a result, I'm not super worried about it for this application.

Or is there some other consideration I am missing here? Looking at perf, it appears my code is spending a fair amount of time in std::atomic_bool::operator bool and I'd rather have it in the loop instead.

T.C.
  • 133,968
  • 17
  • 288
  • 421
John S
  • 3,035
  • 2
  • 18
  • 29
  • 2
    "it appears my code is spending a fair amount of time in `std::atomic_bool::operator bool`" Are you compiling with optimizations on? That thing should be fully inlined. – T.C. Jun 21 '17 at 21:38

1 Answers1

20

You need to use std::atomic to avoid undesired optimizations (compiler reading the value once and either always looping or never looping) and to get correct behavior on systems without a strongly ordered memory model (x86 is strongly ordered, so once the write finishes, the next read will see it; on other systems, if the threads don't flush CPU cache to main RAM for other reasons, the write might not be seen for a long time, if ever).

You can improve the performance though. Default use of std::atomic uses a sequential consistency model that's overkill for a single flag value. You can speed it up by using load/store with an explicit (and less strict) memory ordering, so each load isn't required to use the most paranoid mode to maintaining consistency.

For example, you could do:

// global
std::atomic<bool> run = true;

// thread 1
while (run.load(std::memory_order_acquire)) { /* do stuff */ }

// thread 2
/* do stuff until it's time to shut down */
run.store(false, std::memory_order_release);

On an x86 machine, any ordering less strict than the (default, most strict) sequential consistency ordering typically ends up doing nothing but ensuring instructions are executed in a specific order; no bus locking or the like is required, because of the strongly ordered memory model. Thus, aside from guaranteeing the value is actually read from memory, not cached to a register and reused, using atomics this way on x86 is free, and on non-x86 machines, it makes your code correct (which it otherwise wouldn't be).

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • Thank you for the detailed answer! Much appreciated. – John S Jun 21 '17 at 20:37
  • @ShadowRanger About your statement (_x86 is strongly ordered, so once the write finishes, the next read will see it_).. That's not entirely accurate because `X86` does not prevent #StoreLoad reordering by default.. If you need that guarantee, a full fence (or `xchg`) is required. – LWimsey Jun 21 '17 at 21:21
  • On typical x86 implementations, there's no performance difference between a SC load and a non-SC load. The only difference is in the store, which only happens once in this code. – T.C. Jun 21 '17 at 21:40
  • +1 for the correct answer, but probably worth mentioning that sequential consistency is much easier to reason about, so best advice is to go for that until the customers are bitching about performance and profiling confirms that it's the fault of memory fences (it won't be). – Richard Hodges Jun 21 '17 at 22:31
  • @RichardHodges: True, it's harder to get it wrong with sequential consistency. I suspect in this case even relaxed ordering would work, but my paranoia makes me use `release`/`acquire` semantics just in case there is really data that must be visible from the parent thread during the shutdown process. I agree it's highly unlikely to be a performance issue; unless the loop is nigh empty, the cost of a full memory fence is likely to be small in the context of any real work. – ShadowRanger Jun 22 '17 at 00:14