Cheaper alternative to std::atomic?

Question

I have a class of objects in a multithreaded application where each thread can mark an object for deletion, then a central garbage collector thread actually deletes the object. The threads communicate via member methods that access an internal bool:

class MyObjects {
...   
bool shouldBeDeleted() const
{
   return m_Delete;
}

void
markForDelete()
{
   m_Delete = true;
}
...
   std::atomic< bool >                                        m_IsObsolete;
}

The bool has been made an atomic by someone else in the past because Thread Sanitizer kept complaining. However, perf suggests now that there is a processing overhead during the internal atomic load:

   │     ↓ cbz    x0, 3f4                                                                                                                                                                                                                                                                                                                                                                                            

   │     _ZNKSt13__atomic_baseIbE4loadESt12memory_order():                                                                                                                                                                                                                                                                                                                                                           

   │           {                                                                                                                                                                                                                                                                                                                                                                                                     

   │             memory_order __b = __m & __memory_order_mask;                                                                                                                                                                                                                                                                                                                                                       

   │             __glibcxx_assert(__b != memory_order_release);                                                                                                                                                                                                                                                                                                                                                      

   │             __glibcxx_assert(__b != memory_order_acq_rel);                                                                                                                                                                                                                                                                                                                                                      

   │                                                                                                                                                                                                                                                                                                                                                                                                                 

   │             return __atomic_load_n(&_M_i, __m);                                                                                                                                                                                                                                                                                                                                                                 

   │       add    x0, x0, #0x40                                                                                                                                                                                                                                                                                                                                                                                          

 86,96 │       ldarb  w0, [x0]

Target platform is GCC, Aarch64 and Yocto Linux.

Now my questions are as follows:

Is atomic really needed in this case? The transition of the bool is one way (from false to true) with no way back while the object lives, so an inconsistency would merely mean that the object is deleted a little later, right?
Is there an alternative to std::atomic<bool> that will silence Thread Sanitizer but is computationally cheaper than std::atomic<bool>?

There is [std::atomic_flag](https://en.cppreference.com/w/cpp/atomic/atomic_flag) which may or may not be faster. Also setting the memory order when loading/storing may be helpful. — perivesta, Oct 16 '20 at 14:07
Where is the performance hit you're worried about? Is it on setting the object obsolete (which in any case only happens once/object)? Or is it on checking obsoleteness? If the latter - consider linking obsolete objects onto a list rather than setting a flag in each one - then the list is the list of obsolete objects and the objects need not be interrogated on each GC. — davidbak, Oct 16 '20 at 14:09
Does the garbage collector access any data from/in the object, other than the `m_IsObsolete`? — dyp, Oct 16 '20 at 14:19
@Desperado17: `test_and_set`, ignoring the test part? You don't need a reset, as you already observed yourself. — MSalters, Oct 16 '20 at 14:21
I wonder if `atomic_flag` isn't even _more expensive_ on AArch64: the GC needs the result, so it needs to call `test_and_set`. This seems to be two operations on AArch64. — dyp, Oct 16 '20 at 14:29
Does the "GC" also discover unused objects on its own (not linked in to object graph) or is it only deleting marked objects? If the latter - it is not really GC is it? It's delayed reclamation of memory. Perhaps if you thought of it in that way (regardless of the name someone previously gave it) more solutions would appear for your performance problem. Please clarify! — davidbak, Oct 16 '20 at 14:30
@dyp: There are no calls to object state by the GC between the shouldBeDeleted call and the delete(object). Before that I cannot guarantee. — Desperado17, Oct 16 '20 at 14:32
@davidbak The latter is the case. If you think the term garbage collection is inappropriate for this context, I apologize. — Desperado17, Oct 16 '20 at 14:34
@Desperado17 - no reason to apologize - I'm just suggesting that calling it "GC" brings to mind certain restrictions on your solution space that don't apply to your problem, which apparently is simply delayed deletion of objects on a separate thread in order to unburden the main processing thread (or threads). Thinking of it as delayed deletion instead will free you to consider solutions that _don't_ involve checking an atomic on each object on each "GC" iteration ... such as the linked-list idea I proposed above (but there are others). — davidbak, Oct 16 '20 at 14:39
@MSalters I just realized I likely cannot use atomic_flag since test() is only supported on c++20 and we may only use 11. — Desperado17, Oct 16 '20 at 22:43
@Desperado17: You could use `test_and_set` followed by `clear`, but I admit that's ugly. — MSalters, Oct 16 '20 at 22:56

Den-Jason · Answer 1 · 2020-10-16T14:36:25.447

3

An obvious modification could be to specify memory_order_relaxed to minimise memory barriers.

See https://en.cppreference.com/w/cpp/atomic/memory_order

and https://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/

Also see Herb Sutter's classic "Atomic Weapons" : https://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2

m_Delete.store (true, std::memory_order_relaxed);

Caveat (see articles above) - if there are any co-dependencies on the object being flagged for deletion (e.g. another state variable, freeing resources etc) then you may need to use memory_order_release to ensure that the can be deleted flag setting occurs last and is not reordered by the compiler optimiser.

Assuming the "garbage collector" is only checking the can be deleted flag alone it would not need to use memory_order_acquire in the load; relaxed would be sufficient. Otherwise it would need to use acquire to guarantee that any co-dependent accesses are not reordered to occur before reading the flag.

edited Oct 16 '20 at 14:36

answered Oct 16 '20 at 14:09

Den-Jason

2,395
1
22
17

1

And how would that look? m_Delete.store( true ,std::memory_order_relaxed); ? – Desperado17 Oct 16 '20 at 14:13
@dyp Since this is for GC, I'm not sure the usual dangers apply though. It's OK if a GC run misses an object. It will get it the next time around. – Nikos C. Oct 16 '20 at 14:20
@NikosC. MSVC C++StdLib had a bug in `shared_ptr`'s memory order, where a missing acq/rel caused UB on destruction of the refcounted object. We could have the same here if the GC is accessing any other part of the object. – dyp Oct 16 '20 at 14:22
I now added load and store. The perf trace looks the same with the _ZNKSt13__atomic_baseIbE4loadESt12memory_order(): and the atomic load as seen in the original post. What else could I try? – Desperado17 Oct 16 '20 at 14:55
@Desperado17 have you upped the compiler optimization and declared methods `inline`? Setting the flag *should* reduce to a simple store operation without needing to make function calls. Perhaps also consider having the flag *first* in the object's member list (i.e. is there more than a cache line's worth of attributes?) We're really getting into shaving off nanoseconds here though. – Den-Jason Oct 17 '20 at 16:11
@Den-Jason Optimization is O2. Should I go beyond? – Desperado17 Oct 18 '20 at 18:42
@Desperado17 yes absolutely. Check out Godbolt's "what has my compiler done for me lately" presentation here: https://www.youtube.com/watch?v=bSkpMdDe4g4 – Den-Jason Oct 19 '20 at 07:41

davidbak · Answer 2 · 2020-10-16T14:59:26.790

The problem (as clarified in a comment by the OP) is not a true GC but is instead delayed deletion of objects on a separate thread so as to unburden the main processing threads from the time it takes to to the delete. All objects to be deleted are marked such at some time - at a later time the deletion thread comes along and deletes them.

Consider first: Is it really the case that delayed deletion is necessary in order to meet the program's performance goals - specifically, latency? It might just be extra overhead that actually impacts latency. (Or perhaps there are also different performance goals, e.g., throughput, to be considered.) Delayed deletion isn't an obvious performance win in all cases - you need to find out if it is appropriate in each case. (For example, it might not even be necessary for all deletions: perhaps some deletions can be done immediately in-line without impacting performance while others need to be deferred. This could be because, for example, different processing threads are doing different things with different latency/throughput requirements.)

Now to a solution: Since we're talking deferred deletion - there is no reason the deletion thread needs to scan all objects looking for the ones to delete (each time it does a full scan). Instead, pay a slightly larger cost at the time you mark the object for deletion and pay no cost to scan all objects. Do this by linking deleted objects onto a deletion work list. There is a synchronization cost there (which can be minimized in various ways besides obvious locks) but it is paid once per object not once per object per scan.

(Doesn't have to be a linked list either. If there is an upper bound to how many objects can be deleted in a period of time you can just use an appropriate array.)

There are other possibilities that are opened up by characterizing this problem more precisely as "deferred deletion" rather than "garbage collection": some constraints are lifted (perhaps others are added).

There's no real need for an upper bound to the number of objects to be deleted. Put in an arbitrary but reasonably high limit (e.g. 1000) that's not likely to be exceeded. If that does happen, just stop scanning (blocking the scan thread) until the cleanup thread has freed up space again. Any decent OS will use time slices long enough to let the cleanup thread delete all those 1000 objects in one go. This is better for the CPU cache. — MSalters, Oct 16 '20 at 23:02
Also, I think you can keep the synchronization down even further. Keep a small thread-local cache (e.g. 10 items) and batch the updates to the shared deletion list. — MSalters, Oct 16 '20 at 23:06
@MSalters - yes, your suggestion works for many use cases. and latency limits or not, almost any use case at all has to ensure you free as much as you use in the steady state. there are many variations - and also, many other ways to tailor this approach to fit a "delayed deletion" need. — davidbak, Oct 16 '20 at 23:07
@MSalters - exactly (w.r.t. to your thread-local cache comment) - and BTW - that's one reason why it isn't a slam dunk that delayed deletion is a win - there are so many competing considerations you've got to measure it to be sure. — davidbak, Oct 16 '20 at 23:08

Cheaper alternative to std::atomic?

2 Answers2