memory modeling test in c++11 , curious for memory_order_relaxed

Question

I have read webpage :

http://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/

and then coding the test source compiled at g++ 4.8.1 , cpu is Intel ...

global var : r1=0;r2=0;x=0;y=0;

Thread1 : 
         x  = 1 ;  //line 1 
         r1 = y ;  //line 2
Thread2 :
         y  = 1 ;  //line 3 
         r2 = x ;  //line 4

And I will get r1==0 && r2 == 0 sometimes while run thread1 and thread2 concurrently , I know it is the load of y (line 2) and load of x (line 4) executed before store of x(line 1) , store of y(line 3) ....even strong memory model like intel cpu , load disordered before store still happen , that is why r1==0 && r2 ==0 still happen in this test !!!!

Refering to c++11 memory model , I change source like following :

global vars :
             int r1=0,r2=0 ;
             atomic<int> x{0} ;
             atomic<int> y{0} ; 
Thread1 :
             x.store(1,memory_order_acq_rel) ;
             r1=y.load(memory_order_relaxed) ;
Thread2 :
             y.store(1,memory_order_acq_rel) ;
             r2=x.load(memory_order_relaxed) ;

This time , no results of r1==0 && r2 == 0 happen , that memory_order I used is according to the website I mentioned at the start , see the statements :

memory_order_acquire: guarantees that subsequent loads are not moved before the current load or any preceding loads.

memory_order_release: preceding stores are not moved past the current store or any subsequent stores.

memory_order_acq_rel: combines the two previous guarantees

memory_order_relaxed: all reorderings are okay.

look work out ... still I do another test , I change code to :

global vars :
             int r1=0,r2=0 ;
             atomic<int> x{0} ;
             atomic<int> y{0} ; 
Thread1 :
             x.store(1,memory_order_relaxed) ;
             r1=y.load(memory_order_relaxed) ;
Thread2 :
             y.store(1,memory_order_relaxed) ;
             r2=x.load(memory_order_relaxed) ;

Confuse me is that , this test still get no results of r1==0 && r2==0 !! if this case works , why bother use memory_order_acq_rel ? or this only works in intel cpu ? other kind of cpu still need memory_order_acq_rel in x and y's store ?

Casey · Accepted Answer · 2013-07-31T16:55:00.600

The result from your first experiment is interesting: "And I will get r1==0 && r2 == 0 sometimes while run thread1 and thread2 concurrently ....even strong memory model like intel cpu , load disordered before store still happen" but not only for the reasons you think. Atomics don't only prevent the processor and cache subsystem from reordering memory accesses, but the compiler as well. GCC 4.8 at Coliru optimizes this code to assembly with the load instructions before the stores:

_Z7thread1v:
.LFB326:
    .cfi_startproc
    movl    y(%rip), %eax
    movl    $1, x(%rip)
    movl    %eax, r1(%rip)
    ret

Even if the processor guaranteed memory ordering here, you need some kind of fencing to keep the compiler from screwing things up.

Your second program is ill-formed due to the use of memory_order_acq_rel as the memory ordering for a store. acquire only makes sense for loads, and release only for stores, so memory_order_acq_rel is only valid as an ordering for atomic read-modify-write operations like exchange or fetch_add. Replacing m_o_a_r with memory_order_release achieves the semantics you want, and the assembly produced is again interesting:

_Z7thread1v:
.LFB332:
    .cfi_startproc
    movl    $1, x(%rip)
    movl    y(%rip), %eax
    movl    %eax, r1(%rip)
    ret

The instructions are exactly what we would expect to be generated, with no special fence instructions. The processor memory model is strong enough to provide the necessary ordering guarantees with plain-old mov instructions. In this instance, atomics are only necessary to tell the compiler to keep its fingers out of the code.

Your third program is (technically) unpredictable despite generating the same assembly as the second:

_Z7thread1v:
.LFB332:
    .cfi_startproc
    movl    $1, x(%rip)
    movl    y(%rip), %eax
    movl    %eax, r1(%rip)
    ret

Although the results are the same this time, there's no guarantee that the compiler won't choose to reorder the instructions as it did for the first program. The result may change when you upgrade your compiler, or introduce other instructions, or for any other reason. If you start compiling on ARM, all bets are off ;) It's also interesting that despite relaxing the requirements in the source program, the generated assembler is the same. There's no way to relax the memory ordering outside the restrictions that the processor architecture puts in place.

Thank you,Cassey,it is a great explanation!! I try to figure out this issue for a while , this website explain this test a lot !! http://preshing.com/20120515/memory-reordering-caught-in-the-act , he test this and get r1==0 && r2==0 every 6600 times,my test is kind of the same , he add asm volatile("mfence" ::: "memory"); assambly code to prevent memory reordering , so that r1==0 && r2==0 won't happen , If I don't use atomic for x and y , then I have to use this assambly code in my test source to avoid r1==0 && r2==0 — barfatchen, Jul 31 '13 at 03:51
And as the webpage said : "In particular, each processor is allowed to delay the effect of a store past any load from a different location" , according this statement , that is why I said even intel cpu , load of y still can happen before store of x , and this cause r1==0 && r2==0 , every thousands time execution , i will get one time r1==0 && r2==0 , while x and y not atomic — barfatchen, Jul 31 '13 at 03:52
http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/ The author state : "Loads may be reordered with older stores to different locations " , I know nothing about assambly code , according to the webpage I found , look like cpu allow load before store will play a part for this test , Is your assambly code telling this information ? — barfatchen, Jul 31 '13 at 04:14
@barfatchen Ouch, yes, I completely screwed up that X86 rule. Which is ironic, given that intel's processor manual description uses this exact program as the example for reads moving before writes to different locations. Will fix the answer momentarily, — Casey, Jul 31 '13 at 16:50
There, corrected. It drives home the point that memory orderings are very much black magic: anything outside of the default sequentially consistent semantics should be considered a micro-optimization and a dangerous one at that. — Casey, Jul 31 '13 at 16:56
"_cache subsystem from reordering memory accesses_" how can the cache do that? — curiousguy, Dec 09 '19 at 00:17

briand · Answer 2 · 2013-09-02T05:21:23.370

There are a bunch of issues here: (1) Releases and acquires must be in pairs. Otherwise, they don't establish synchronization and don't guarantee anything. (2) Even if you make the stores release and the loads acquire in your example, the memory model still allows r1=r2=0. You need to make everything seq_cst to forbid that execution. (3) We've built a tool at http://demsky.eecs.uci.edu/c11modelchecker.html for testing C11 atomic code. It will give you all executions allowed under reasonable interpretations of the C/C++11 memory model.

You may not see these interesting behaviors on current GCC versions yet, as at least the earlier versions ignored the memory ordering parameter and always used seq_cst. If GCC changes that, you could see r1=r2=0.

memory modeling test in c++11 , curious for memory_order_relaxed

2 Answers2