1

This question does not assume any specific architecture. Assume that we have a multicore processor with cache coherence, out-of-order execution, and branch prediction logic. We also assume that stores to memory are strictly in program order.

We have two threads running in parallel, each on a separate core.

Below are the threads’ pseudo-code. data and flag are initially 0.

Thread #1 code:

data=10;
flag=1;

Thread #2 code:

while(!flag);
print data;

With proper synchronization, Thread #2 would eventually print 1. However, the branch predictor could potentially predict that the loop is not entered, thus perform a speculative read of data, which contains 0 at that time (prior to Thread #1 setting data). The prediction is correct, i.e. ‘flag’ is eventually set to 1. In this case the print data instruction can be retired, but it prints the incorrect value of 0.

The question is whether a memory barrier would somehow prevent the speculative read of data, and cause the cpu to execute the busy wait properly. An alternative solution could be to let the branch predictor do its work, but snoop the writes done by the other core, and in case a write to data is detected, we can use the ROB to undo the premature read (and its dependent instructions) and then re-execute with the proper data.

Arch-specific answers are also welcome.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Daniel Nitzan
  • 1,582
  • 3
  • 19
  • 36

1 Answers1

1

No, branch prediction + speculative execution is fine in an ISA with memory barriers, as long as mis-speculation is killed properly.

thus perform a speculative read of data, which contains 0 at that time

When the CPU detects the misprediction, instructions from the mis-speculated path of execution are discarded, along with their effects on architectural registers.

When the correct path of execution does eventually exit the loop, then the memory barrier will run (again), then the load of data will run (again). The fact that they earlier ran in the shadow of a mis-predicted branch has no effect.

Your pseudo-code assembly isn't very clear because it makes print data look like a single operation. In fact it will involve a load into a register and then a call print instruction.

When the data load runs on the correct path, it will have to redo the work of reading a value from cache, and cache is coherent across cores. It doesn't matter if the mis-speculated load brought the cache line into this core's L1d cache; a store by another core will have to invalidate it before that store can become globally visible.

The loop exits after seeing exit!=0; the barrier after that makes sure that later loads haven't already happened, giving acquire semantics to the load of exit (assuming it includes blocking LoadLoad reordering).

The barrier executing on the correct path makes sure that this core waits for that invalidation instead of using an early load.

A store / release barrier in the writer makes sure that the new data value is globally visible before exit = 1 is visible to any other threads on any cores.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • First, thank you for your amazing contributions on everything low-level on SOF. I learn a lot from your educational answers. There might be some ambiguity in my question. It’s assumed that by the time that the actual value of `flag` is read, it has already been set to 1 by the writer, so the original prediction is correct by not entering the loop, and no mis-prediction logic is carried out. In which case I would assume that `data` is 0, unless somehow reloaded. Hope I’m making myself clearer this time. – Daniel Nitzan May 11 '20 at 15:02
  • @DanielNitzan: If there's a barrier (explicit on a weakly-ordered ISA, or implicit on x86 where every load is an acquire load) between reading `flag` and reading `data`, then the CPU isn't allowed to use any value it read for `data` that isn't still valid by the time it actually reads the value of `flag` from cache. Modern x86 CPUs do speculatively do later loads, and handle this problem with a memory ordering machine clear rollback of out-of-order state if `data` isn't still valid by the time the architectural memory model allows them to read it. – Peter Cordes May 11 '20 at 15:38
  • Or on a weakly ordered ISA, a barrier instruction would probably just block exec of the later load until the `flag` load has actually taken a value. Speculative execution past a branch would just let the fence start waiting right away, like the case where there's no branch. This applies whether or not you're branching on it. I don't see how branch prediction could be relevant if you're only considering the correctly-predicted case, and with no data dependency between one load and the next (like for `mo_consume`). – Peter Cordes May 11 '20 at 15:40
  • So in practical terms, I don’t see a way for the CPU to know that `data` is invalid by the time it reads `flag`, other than snoop the other processor’s write to `data`. If that write is snooped prior to retirement of the read instructions, the rollback/re-read mechanism should kick in. – Daniel Nitzan May 11 '20 at 18:28
  • 1
    @DanielNitzan: A message / directory-based MESI implementation like real-world modern CPU use doesn't have to actually snoop. Another core wanting to write will have to actively do an RFO (read for ownership) to gain exclusive ownership (Exclusive / Modified state) of the cache line *before* it can commit a store from its store buffer to its L1d cache. That means this core's Shared copy will be invalidated. A speculative early load could monitor the line for that. Or if you have an actual barrier instruction before the read of `data`, the load won't happen early in the first place. – Peter Cordes May 11 '20 at 18:32
  • That makes sense. On the x86, given its strong memory consistency, we don’t use any barriers and I assume that the interconnect monitoring is the mechanism employed. – Daniel Nitzan May 11 '20 at 18:46
  • 1
    @DanielNitzan: Right, since x86 has strong load ordering that would hurt performance in normal single-threaded code working on private data, real implementations *speculatively* load out of order. On weakly-ordered ISAs, they can just load out of order all the time without having to track the status of that cache line until retirement. – Peter Cordes May 11 '20 at 18:53