Assuming the source and target architectures are different, how do emulators efficiently translate memory barriers? I know that in general modern emulators will employ a JIT to translate from the source ISA to the target ISA, but knowing which code is reachable by multiple program counters and which isn't seems pretty tricky, and then knowing which instructions are safe to reorder (which maybe required for the JIT to generate something efficient due to ISA differences) and which are not seems extremely tricky.
You're not even guaranteed to find an explicit memory barrier in the instruction stream, e.g. many people on x86 rely on aligned word writes to be atomic. Are emulators conservatively assuming that every aligned word write can't be reordered? That seems like a potentially huge overhead, which leads me to wonder if there are any known analyses for tackling this sort of problem.