1

I had implemented a LIFO for shared memory context using assembly for ARMv8 64bit.

The LIFO inserts a node in beginning and each node structure's first attribute must be next pointer.

Is this correct assembly for implementing atomic insert and delete for the LIFO?

It uses LL/SC with an extra load or store between the LDXR and STXR to read the head->next or store a pointer into a new node.

typedef union {
   void * head[1];
}lifo;
int atomic_lifo_init(lifo * h) {
if (h) {
   h->head[0]=NULL;
  }
}
inline void *
 atomic_lifo_delete (lifo *h)
 {
         void    *ret = NULL;
         /*sa_ignore UNUSED_VAR*/
         void *  tmp = NULL;
 
         asm volatile ("\n"
                 "2:  ldxr      %0,[%2] \n" //load the head from h, which points the 1st node of list to local ret pointer.
                 "    cbz     %0, 3f \n" //check if the lifo is empty.
                 "    ldr      %1,[%0] \n" //store in tmp the 2nd node by derefencing the ret (h->head points 1st node. value of each node is next node as 1st attribute of node structure is pointing next.)
                 "    stxr     %w1, %1,[%2] \n" //update h->head with tmp.
                 "    cbnz     %w1, 2b \n" //continue if failed
                 "3:              \n"
                 : "=&r" (ret), "=&r" (tmp)
                 : "r" (h)
                 : "memory"
                 );
         return(ret);
 }
 
 /*
  * atomic_lifo_insert()
  *      Put an element on a list, protected against SMP
  */
 void
 atomic_lifo_insert (lifo *h, void *__new)
 {
         /*sa_ignore UNUSED_VAR*/
         void * next = NULL;
         void * flag = NULL;
         asm volatile (" \n"
                 "1: ldxr      %1,[%2] \n" //load head[0] from h,which points 1st node to local next pointer
                 "   str      %1,[%3] \n" //store the local next pointer to value of __new, as 1st attribute of the any node is next (convention used here). so __new's next is pointing current 1st node.
                 "   stxr    %w0, %3,[%2] \n" //update the h->head with 
   __next.
                 "   cbnz    %w0, 1b \n" //if stxr is failure try again.
                 : "=&r" (flag), "=&r" (next)
                 : "r" (h), "r" (__new)
                 : "memory"
                 );
 }

I am really new to ARM assembly, so any help is really appreciated.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
souradeep
  • 31
  • 4
  • Can you comment your code with what it's doing? I see you're doing a load inside an LL/SC transaction. That's something you can't express with stdatomic. Does that actually work, though, without forcing the `stxr` to fail? Being dependency-ordered after the `ldxr` should be fine, another thing you can't do with pure C11. But the `str` inside the transaction in `insert` might reorder out; do you need `stlxr` there? Also I'd be more worried about that SC failing if you do other stores before the `stxr`. – Peter Cordes Jun 19 '20 at 15:16
  • I have updated the code with comment, @PeterCordes – souradeep Jun 20 '20 at 13:08
  • @PeterCordes in insert str is updating the local pointer __new (new node), but stxr is updating the global h. (where h->head is pointing 1st node of lifo.) In normal testing it is working but in shared memory context sometime it is failing. Like some process is finding that list has become empty though not all nodes are not deleted. The same processes were working with intel atomic lifo implementation though , where lock cmpxchange was used. – souradeep Jun 20 '20 at 13:14

1 Answers1

1

Your inline asm constraints look correct, this should compile the way you intended. You probably could use "+m" (*h) to let the compiler pick an addressing mode instead of hard-coding it with "r"(h) and [%2].

As far as ordering, the ldr is dependency-ordered after the ldxr (like C11 memory_order_consume), so that works.

But the str between the LL/SC in insert might not become visible until after stxr publishes the address to other threads. So I think you need stlxr (a release-store) in insert.


Doing an extra load or store between LDXR/STXR is not safe. Wikipedia's LL/SC article mentions that some implementations can spuriously fail if you do any load or store between the LL and SC. Wiki says PowerPC specifically does allow this. But AArch64 in general explicitly does not: per the ARM ref manual (see @James Greenhalgh's comment):

LoadExcl/StoreExcl loops are guaranteed to make forward progress only if [...] Between the Load-Exclusive and the Store-Exclusive, there are no explicit memory accesses.

There can be some AArch64 CPUs where you'd create an infinite loop of stxr failure, but there might be others where it does work. If it works in practice on a CPU you test, probably a good idea to see if there's any documentation to back that up.

This is most likely to be a problem if the new_ node happens to be in the same cache line (or LL/SC exclusivity block) as the head node. Make sure you test that case, or make it impossible somehow, if this works at all on the microarchitecture(s) you care about.


Other than that I think your overall algorithm looks correct, so if you've tested it and found it works then that's probably good.

However, I haven't really carefully thought through your algorithm; I also don't have any experience designing or using stuff around raw LL/SC. I know in principle how they work, but I'm not prepared to say this is definitely correct. I don't see any problems from the little bit of thinking about it I've done, but that doesn't mean there aren't any.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • thanks for the response. So I should change ldxr to ldaxr as well. Or only ldxr/ then str then stlxr. – souradeep Jun 22 '20 at 06:27
  • @souradeep: I don't think you need `acq_rel` ordering for the RMW in insert, just `release`. So `ldxr / str / stlxr`. Plus the `str` itself is dependency-ordered after the `ldxr` anyway; the store data comes from the load so there's no way for the store to become visible to other threads before the load reads from cache. – Peter Cordes Jun 22 '20 at 06:49
  • Thanks, will modify to stlxr from stxr. So this implementation is safe for shared memory context also I guess? Means if the lifo header is created on shared memory using mmap() and multiple process can insert delete nodes. – souradeep Jun 22 '20 at 10:21
  • @souradeep: yes, if it's correct at all (which I think but I'm not much of an LL/SC or ARM expert, and I haven't 100% thought through your whole algorithm), it will work between processes with shared mem the same as for threads in a process. (Except coordinating deallocation is probably harder if you can't use `malloc` / `free`) – Peter Cordes Jun 22 '20 at 15:02
  • 1
    The LDR in the middle of the LL/SC loop is not a good idea. From my copy of the Arm Architecture Reference Manual (ARM DDI 0487C.a B2.9.5): "LoadExcl/StoreExcl loops are guaranteed to make forward progress only if [...] Between the Load-Exclusive and the Store-Exclusive, there are no explicit memory accesses." – James Greenhalgh Jun 30 '20 at 09:00
  • 1
    @JamesGreenhalgh: Thanks for that quote from the docs, revised my answer. – Peter Cordes Jun 30 '20 at 09:11
  • @JamesGreenhalgh in that case , if I remove ldr, from between ldxr and stxr, then what should be the correct way to achieve the same result? Also same is for str in between ldxr and stxr then? – souradeep Jul 09 '20 at 08:29
  • @souradeep: James quoted the relevant sentence. Stores are also "memory accesses". Without this LL/SC mini-transaction, you'd probably have to fall back to the usual CAS retry loop, although it's been a while since looking at your question so I forget exactly how your design worked (if I ever fully understood). I'm sure there are examples of atomic LIFO linked lists in C++11; any of those would be fine. Without being able to do extra memory operations between LL and SC, the only thing it gains you vs. CAS is protection from the "ABA problem". (Maybe still useful). – Peter Cordes Jul 09 '20 at 10:08