3

I am trying to profile multi-threaded workloads with gperftools but am having difficulty interpreting the output produced. I have written a simple program that launches two threads with identical workloads and profiled with gperftools cpu profiler. In the output I can see the two functions for each thread but the overhead of each one varies hugely from one run to the next. I would expect both functions to show identical results, since they are identical workloads, but in fact, one might be 90% and the other 10%, or sometimes 80%/20%, 95%/5% etc. I don't understand why the functions are showing different overhead, or why the result changes so much from one run to the next. The benchmark runs for about 5 seconds and has 1600 samples so should be stable.

Is there any documentation that explains how profiling works for multithreaded workloads, and how to interpret the output? For example, does the profiler do a backtrace for every thread on every sample and if not, what is it doing?

#include <vector>
#include <cstdlib>
#include <thread>
using namespace std;

void thread_func() {
    int size = 500000;
    vector<int> V(size);
    for(int i = 0; i < 100000; i++) {
        V.erase(V.begin() + (rand() % size));
        V.insert(V.begin() + (rand() % size), rand() % 10);
    }
}

void thread_func2() {
    int size = 500000;
    vector<int> V(size);
    for(int i = 0; i < 100000; i++) {
        V.erase(V.begin() + (rand() % size));
        V.insert(V.begin() + (rand() % size), rand() % 10);
    }
}

int main() {
    srand(1234);
    thread t1(thread_func);
    thread t2(thread_func2);
    t1.join();
    t2.join();
    return 0;
}

Sample output:

0   0.0% 100.0%     1429  89.3% thread_func
0   0.0% 100.0%      172  10.7% thread_func2

Where is the 89.3% and 10.7% coming from? (These are the percentage of total samples in the function and its callees)

Image contains a portion of the graph, numbers are slightly different from above because it's a different run

  • What do the columns mean in your output? I think this is especially useful to know for the columns which differ. Also if you would visualize the results a bit more, it might already shed some light. – hakre Sep 15 '17 at 15:24
  • The interesting columns are the 4th and 5th: number of samples in this function and its callees, percentage of total samples in this function and its callees. The first 2 columns relate to samples inside the listed function, and 3rd column is cumulative percentage of all functions printed so far in the output – Niall Murphy Sep 15 '17 at 15:33
  • I guess what I would like to know is, for each sample, is the profiler doing a backtrace of every running thread? And if not, what is it doing? – Niall Murphy Sep 15 '17 at 15:35
  • Sounds more like guessing then ... anyway I'm no gperfs pro, so can't tell you that, was just commenting as it's your first post. – hakre Sep 15 '17 at 15:44

1 Answers1

3

This is known issue with SIGPROF signal delivery. See https://github.com/golang/go/issues/14434 for some details.

Gperftools actually has "fix" for the skew (as mentioned on that issue). You just need to set CPUPROFILE_PER_THREAD_TIMERS=t and ensure that librt and libpthread is linked. And you'll also need to "register" your threads or LD_PRELOAD https://github.com/alk/gperf-all-threads

  • @Aliaksei Kandratsenka, thanks for the nice answer. I have a follow up question and I am not sure if it deserves a separate post: with gperftools' multi-thread support turned on, how do I output profiling results for each thread separately? With your setup it seems to me all results are collected in a single output file specified by CPUPROFILE. – Leo Fang Feb 22 '18 at 20:50
  • There is no way to track thread-specific profiles. There is bare-bones support for profiling just specific subset of threads. See ProfileOptions struct. Profiling specific threads is occasionally quite useful. – Aliaksei Kandratsenka Feb 25 '18 at 02:01