Finding threading bottlenecks and optimizing for wall-time with perf

Question

Sampling cpu-cycles with perf record is useful for finding optimization candidates if core-utilization is roughly constant. But for code that has multiple phases differing in parallelism counting cpu-cycles will emphasize heavily parallel phases while under-emphasizing sequential or limited-parallelism phases that impact wall-time. In short, naïve perf use may highlight the wrong limb of amdahl's law

So the question is how to get perf record/perf report to find optimization candidates for reducing wall-time which could be anything from the hottest loop in consistently parallel code, over a moderately-parallel bottleneck to a long single-threaded phase.

Known workarounds that leave something to be desired:

executeing the workload on a single core so that wall-time ≅ cpu-cycles
profiling individual components separately

^{meta: this is a perf-specific followup to a more general question}

Insofar as you have asked five questions and given 1000 answers, today's question is a rare event, isn't it? — thb, Mar 13 '19 at 00:12
If your parallel program has something like OpenMP or MPI parallelism and there is no oversubscribing and threads are bound to the cores (OMP_PROC_BIND, affinity) you can profile only cpu core with the main thread (`perf record -C 0 ./omp_program` or `perf report -C 0`) - it will partially remove the wrong limb. Second idea - do a diff between main thread and worker thread (`-C 1`). Third idea: add signalling using trace events into your parallel library and try to use `--switch-on`/`--switch-off` of [perf-report](http://man7.org/linux/man-pages/man1/perf-report.1.html). Could you add example? — osgx, Feb 25 '20 at 10:33

score 2 · Accepted Answer · answered Mar 27 '21 at 14:39

2

KDAB Hotspot is a GUI that can analyze perf record output and also show context switches and core utilization if the profiles have been recorded with -e sched:sched_switch --switch-events --sample-cpu

answered Mar 27 '21 at 14:39

the8472

40,999
5
70
122

no, and I don't think comments are appropriate for that kind of troubleshooting. – the8472 Oct 15 '21 at 18:43

Finding threading bottlenecks and optimizing for wall-time with perf

1 Answers1

Linked