tcmalloc huge performance variance

Question

Our multi-threaded server has hundreds connection threads that are responsible for IO handing and replying to the incoming requests.

There is another asynchronous thread that runs relatively heavy tasks with many allocations from time to time (say every few seconds).

Once I converted that thread to a a small thread pool (i.e. those tasks now run from different threads each time) our server usually has the same CPU usage but it can suddenly reach the state were allocations across all operations take much more time and the overall CPU usage of the server almost doubles from 2 cores to 3.7 cores.

My main theory so far is that I somehow changes access pattern for tcmalloc library and that causes random CPU lifts. What should i look at in tcmalloc stats in order to confirm this theory? Can it be that the same code running now from different threads (but not simultaneously) causes tcmalloc to allocate from the central cache more than from the thread cache?

I would think false sharing or some other form of contention between the threads in the new thread pool is a more likely explanation. — David Schwartz, Aug 19 '15 at 08:06
What's this false sharing? Tasks do not have contention between them and the thread pool is lock free. Also I see in the profiler that tcmalloc portion of the CPU grows signicantly compared to non-lifted CPU state of the server. — Roman, Aug 19 '15 at 08:07
False sharing is when two threads share data in the same cacehline. And I agree that's quite likely the problem. — Mats Petersson, Aug 19 '15 at 08:08
False sharing is when two threads run concurrently and access resources that appear to be separate but actually cause contention -- for example because they share a cache line. — David Schwartz, Aug 19 '15 at 08:08
Thanks, that's very interesting idea. Can I confirm it by looking at gperftools profiler graph? — Roman, Aug 19 '15 at 08:14

score 1 · Answer 1 · answered Sep 25 '15 at 12:12

As several commenters have suggested, false sharing might be the problem. Finding false sharing is difficult and not well-supported by current tools. My research group has published these research papers on the topic - at a minimum, they provide an excellent introduction to the problem of false sharing and why it is so insidious.

The tools corresponding to these research papers are available on GitHub: Sheriff, Predator.

While you could try to use one of these tools to find the problem, the easiest thing would be to give Hoard a try. Hoard is a fast, scalable malloc replacement whose design reduces the risk of allocator-induced false sharing. If replacing tcmalloc with Hoard doesn't solve your problem, then it might make sense to pursue other avenues.

Hoard is generally faster and more space-efficient. For example, here's a microbenchmark that repeatedly allocates and frees memory in different threads. Lower times are better (obviously!). `tcmalloc`: 6.09s (1 thread), 4.03s (4 threads) `Hoard`: 3.69s (1 thread), 1.73s (2 threads) (tested on a MacBook Air, latest builds of each) — EmeryBerger, Sep 27 '15 at 15:24

tcmalloc huge performance variance

1 Answers1