Does the WKWYL optimization in make_shared<>() introduce a penalty for some multithreading applications?

Question

A few days ago I happened to watch this very interesting presentation by Stephan T. Lavavej, which mentions the "We Know Where You Live" optimization (sorry for using the acronym in the question title, SO warned me the question might have been closed otherwise), and this beautiful one by Herb Sutter on machine architecture.

Briefly, the "We Know Where You Live" optimization consists in placing the reference counters on the same memory block as the object which make_shared is creating, thus resulting in one single memory allocation rather than two and making shared_ptr more compact.

After summing up what I learnt from the two presentations above, however, I started to wonder whether the WKWYL optimization could not degrade performance in case shared_ptr is accessed by multiple threads running on different cores.

If the reference counters are close to the actual object in memory, in fact, they should be more likely to be fetched into the same cache line as the object itself. This in turn, if I got the lesson correctly, would make it more likely that threads will slow down while competing for the same cache line even when they do not need to.

Suppose one of the threads needs to update the reference counter several times (e.g. when copying the shared_ptr around), while the other ones just need to access the pointed object: isn't this going to slow down the execution of all threads by making them compete for the same cache line?

If the refcount lived somewhere else in memory, I would say contention would be less likely to arise.

Does this make a good argument against using make_shared() in similar cases (as long as it implements the WKWYL optimization, of course)? Or is there a fallacy in my reasoning?

Can you summarize what the "WKWYL" optimization is. It is not really a standard expression. In fact, it is so uncommon, this page is already on the front page of a search for it. :) — Andrew Tomazos, Jan 15 '13 at 16:30
@AndrewTomazosFathomlingCorps: it means "We Know Where You Live". basically, `make_shared()` places the reference counters in the same memory block as the object it creates. see slide 6 of the presentation by Stephan T. Lavavej I linked — Andy Prowl, Jan 15 '13 at 16:32
Is the alternative you are comparing to storing the reference count with the shared object - storing it in a second dynamically allocated block? If so you have avoided the false sharing issue, but have made a much worse double payment for a second dynamic allocation, which is order of magntitudes more expensive than a cache miss. — Andrew Tomazos, Jan 15 '13 at 16:36
@AndrewTomazosFathomlingCorps: yes, which is basically what happens when you do just `shared_ptr pA(new A(...))`. — Andy Prowl, Jan 15 '13 at 16:39
@AndrewTomazosFathomlingCorps: what do you mean by "have made a much worse double payment for a second dynamic allocation"? that happens only once, while accessing the object and copying the `shared_ptr` around may happen a million times. — Andy Prowl, Jan 15 '13 at 16:41
If you're copying around a shared_ptr a million times than it will be better to do manual memory management (have a single owner object holding a unique_ptr that covers the lifetime of the shared object). A shared_ptr has to do an atomic inc and dec when it updates its reference count, so the reference count cant use the caches anyway and is much slower to copy than a raw pointer (which is just an integer copy). — Andrew Tomazos, Jan 15 '13 at 16:45
@AndrewTomazosFathomlingCorps: it that is meant to be an answer to my question, then feel free to post it as an answer. I'll wait till I get some more and accept it if I find it the most convincing — Andy Prowl, Jan 15 '13 at 16:50
There's not necessarily a fallacy in your reasoning, but you have to consider the general use case. Constant access by multiple threads in your scenario is far less common use. Put another way, across all domains the optimization pays off, but maybe not on yours. — GManNickG, Jan 15 '13 at 16:51
@GManNickG: that's kind of the answer I expected. the guru advice seems to be "*always* use make_shared". I just wanted to know how often is "always". I understand there's also advantages on exception safety, but it might not be true that there is no trade-off choice to make — Andy Prowl, Jan 15 '13 at 16:55
When the guru says "*Always* do this young Padawan", he leaves it to you to discover the very rare exceptions. That's part of your training. — Bo Persson, Jan 15 '13 at 18:26
"Always do this" could be taken to mean, "always do it; then profile; then do whatever unspeakable things are necessary to get your code to run fast enough, ignoring the usual advice if necessary". — Steve Jessop, Jan 15 '13 at 18:28
I think, as Herb pointed out in an edit to an answer below, that "WKWYL" does not refer to allocating the control block and object at the same location, it refers to storing the object's address only once. All `make_shared` implementations allocate a single block, but Boost's and GCC's store the object's address once in the control block and once in the `shared_ptr`. MSVC's only stores it once, because it knows where it lives. So your question is about `make_shared` not about WKWYL. (It would be very easy to make that additional optimization in GCC's implementation, but I didn't think of it.) — Jonathan Wakely, Jan 15 '13 at 23:32
@JonathanWakely: thank you for pointing that out, makes it clearer - i am still learning so it's possible that i made some confusion — Andy Prowl, Jan 15 '13 at 23:40

Steve Jessop · Accepted Answer · 2013-01-16T10:08:14.380

If that's your usage pattern then sure, make_shared will result in "false sharing", which is the name I know for different threads using the same cache line even though they aren't accessing the same bytes.

The same is true for any object of which nearby parts are used by different threads (one of which is writing). In this case the "object" is the combined block created by make_shared. You could as well ask whether any attempt to benefit from data locality can backfire in cases where proximal data is used in different threads more-or-less simultaneously. Yes, it can.

One can conclude that contention is less likely to arise if every writable part of every object is allocated in distant locations. So, usually the fix for false sharing is to spread things out (in this case, you could stop using make_shared or you could put padding into the object to separate its parts into different cache lines).

As against that, when the different parts are used in the same thread, if you've spread them through memory then there's a cost to that, because there's more to fetch into cache. Since spreading things out has its own costs, that might not actually help for quite so many apps as you'd first think. But no doubt it's possible to write code for which it does help.

Sometimes the benefit of make_shared is nothing to do with cache lines and locality, it's simply that it makes one dynamic allocation instead of two. The value of that depends on how many objects you allocate and free: it might be negligible; it might be the difference between your app fitting in RAM vs. swapping like crazy; in certain cases it could be necessary for your app to make all the allocations it needs to.

FYI, there's another situation to maybe not to use make_shared, and that's when the object isn't small and you have weak pointers that significantly outlive the shared_ptr. The reason is that the control block isn't freed until the weak pointers are gone, and hence if you used make_shared then the whole memory occupied by the object isn't freed until the weak pointers are gone. The object will be destructed as soon as the shared pointers are, of course, so it's just the size of the class that matters, not associated resources.

I upvoted both questions, but this one seems to give me a broader conceptual viewpoint on the nature of the trade-off which is involved (and on the fact that indeed there is a compromise to be made) and offers IMO more intriguing insights. — Andy Prowl, Jan 15 '13 at 19:37

score 5 · Answer 2 · edited Jan 15 '13 at 17:54

Note that allocating the ref count isn't directly about the WKWYL optimization -- that's the primary intended effect of std::make_shared itself. You have full control: use make_shared<T>() to save an allocation and put the reference count with the object, or use shared_ptr<T>( new T() ) to keep it separate.

Yes, if you place the object and the reference count in the same cacheline, it might lead to performance degradations due to false sharing, if the reference count is updated frequently while the object is accessed only reading.

However the way I see it there are two factors why this isn't factored into the decision for doing this optimization:

In general you don't want the reference count to change frequently, since that by itself is a performance problem (atomic operations, several threads accessing it, ...) which you want to avoid (and probably can for most cases)
Doing this optimization doesn't necessarily incur the potential extra performance problems you described. For that to happen the reference count and (parts of) the object need to be in the same cacheline. It could therefore easily be avoided by adding appropriate padding between the reference count (+other data) and the object. In that case the optimization would still only do one allocation instead of two and therefore still be beneficial. However for the more likely case which doesn't trigger this behaviour, it would be slower then the non padded version, since in the non padded version you benefit from the better locality (the object and reference count being in the same cacheline). For this reason I think that this variant is a possible optimization for highly threaded code, but not necessarily one to be made in the standard version.
If you know how shared_ptr is implemented on your platform you could emulate the padding, either by inserting padding into the object, or (possibly, depending on the order in memory), by giving it a deleter, which includes enough padding.

Also important to keep in mind is that one can always write `make_shared_blocked` (or whatever name is appropriate) to simply do `return shared_ptr(make_unique(Args...));` to keep them separate yet be exception safe. — GManNickG, Jan 15 '13 at 17:39

score 4 · Answer 3 · answered Jan 15 '13 at 20:52

Suppose one of the threads needs to update the reference counter several times (e.g. when copying the shared_ptr around), while the other ones just need to access the pointed object: isn't this going to slow down the execution of all threads by making them compete for the same cache line?

Yes, but is that a realistic scenario?

In my code the threads that copy the shared_ptr do so because they want to share ownership of the object so they can use it. If the threads making all those reference-count updates don't care about the object, why are they bothering to share in ownership of it?

You can mitigate the problem by passing around const shared_ptr& references and only making (or destroying) a copy when you actually want to own and access the object, e.g. when transferring it across thread or module boundaries or when taking ownership of the object to use it.

In general, intrusive reference counts outperform external reference counts (see Smart Pointer Timings) precisely because they're on a single cache line and so you don't need to use up two precious cache lines for the object and its refcount. Remember that if you've used up an extra cache line that's one less cache line for everything else, and something will get evicted and you'll get a cache miss when that is next needed.

"If the threads making all those reference-count updates don't care about the object, why are they bothering to share in ownership of it?" Maybe that thread cares about reading the object but not writing it? I think the core issue is that copying a `shared_ptr` is a store to the control block. So you can have a situation where with `new` you have a cache line (the object) that's read from many threads and another cache line (the control block) written from one thread, and no contention. With `make_shared` you have false sharing. Ofc I agree that `make_shared` tends to be worthwhile. — Steve Jessop, Jan 16 '13 at 10:04
Even if some thread wants to only read the object, how many times does it need to touch the refcount? My point is firstly that in cases where `make_shared` gives false sharing, if the ref-count updates are causing too much contention then maybe you're updating the ref-counts more often than necessary, and secondly that in the majority of cases the sharing isn't false and is desirable. — Jonathan Wakely, Jan 16 '13 at 12:59

Does the WKWYL optimization in make_shared<>() introduce a penalty for some multithreading applications?

3 Answers3

Linked