lock contention in memory allocation - multi-threaded vs. multi-process

Question

We have developed a big C++ application that is running satisfactorily at several sites on big Linux and Solaris boxes (up to 160 CPU cores or even more). It's a heavily multi-threaded (1000+ threads), single-process architecture, consuming huge amounts of memory (200 GB+). We are LD_PRELOADing Google Perftool's tcmalloc (or libumem/mtmalloc on Solaris) to avoid memory allocation performance bottlenecks with generally good results. However, we are starting to see adverse effects of lock contention during memory allocation/deallocation on some bigger installations, especially after the process has been running for a while (which hints to aging/fragmentation effects of the allocator).

We are considering changing to a multi-process/shared memory architecture (the heavy allocation/deallocation will not happen in shared memory, rather on the regular heap).

So, finally, here's our question: can we assume that the virtual memory manager of modern Linux kernels is capable of efficiently handing out memory to hundreds of concurent processes? Or do we have to expect running into the same kind of problems with memory allocation contention that we see in our single-process/multi-threading environment? I tend to hope for a better overall system performance, as we would no longer be limited to a single address space, and that having several independent address spaces would require less locking on the part of the virtual memory manager. Anyone have any actual experience or performance data comparing multi-threaded vs. multi-process memory allocation?

Hint: use vertical spacing (also called paragraphs); helps a lot with text readability. I hope you write better code than normal text. — GhostCat, Sep 15 '16 at 11:18

score 1 · Answer 1 · answered Sep 15 '16 at 11:17

I tend to hope for a better overall system performance, as we would no longer be limited to a single address space, and that having several independent address spaces would require less locking on the part of the virtual memory manager.

There is no reason to expect this. Unless your code is so badly designed that it constantly goes back to the OS to allocate memory, it won't make any significant difference. Your application should only need to go back to the OS's virtual memory manager when it needs more virtual memory, which should not occur significantly once the process reaches its stable size.

If you are constantly allocating and freeing all the way back to the OS, you should stop doing that. If you're not, then you can keep multiple pools of already-allocated memory that can be used by multiple threads without contention. And, as a benefit, your context switches will be cheaper because TLB's don't have to be flushed.

Only if you can't reduce the frequency of address space changes (for example, if you must map and unmap files) or if you have to change other shared resources (like file descriptors) should you look at multiprocess options.

So you're saying that it's not a question of multi-threading vs. multi-processing but rather one of a prudent memory economy, right? Some more background: our application constantly creates new threads on demand that only live several minutes, there is no thread pooling. Memory allocation/deallocation is by means of plain C++ new and delete (which map to plain malloc/free which in turn gets redirected to tcmalloc). Are you suggesting that we should consider implementing our own C++ memory management layer, i.e. overloading operator new? — Paul T., Sep 16 '16 at 06:50
@PaulT. If you're already using an excellent `malloc` implementation that's optimized for your use case and you're correct that it's allocation performance that's a bottleneck, that would be a possible the next step. (Confirm those first two things first though.) — David Schwartz, Sep 16 '16 at 16:22

lock contention in memory allocation - multi-threaded vs. multi-process

1 Answers1