I'm trying to benchmark a piece of DSP code on a Raspberry Pi 4 using the std::chrono::steady_clock, but the results I'm getting are peculiar. Because GNU profiling tools don't work on Raspberry Pi, I'm stuck with benchmarking to evaluate code optimizations, so this is rather a big deal.
What would cause performance to vary by 10% between executions of the benchmark program, while remaining consistent +/- 1% when the same test is run multiple times in the same execution of the program?
Results for a ~6-second benchmark vary by ~10%. But the peculiar thing is that the variance seems to be sticky for a particular execution of the benchmark. I run the benchmark three times in a row each time the program is run, and get roughly the same results +/- 1%. But when I re-run the program the results of the three benchmarks vary by +/- 10% from the previous run, but with each of the three results in the new run being +/- 1%.
e.g:
Run 1:
9:21:37. Performance: 0.0912333 x realtime
9:21:42. Performance: 0.0910667 x realtime
9:21:47. Performance: 0.0910667 x realtime
Run 2:
9:20:15. Performance: 0.106667 x realtime
9:20:21. Performance: 0.1062 x realtime
9:20:28. Performance: 0.106117 x realtime
Results vary randomly roughly between those two extremes for each run. But the peculiar thing here is that the results are consistent to +/- 1% between the three tests performed each time the program is run.
I'm an experienced programmer, so I get that benchmarks will vary somewhat. But the ~10% variance is unworkable for what I'm trying to do. And I am unable to come up with a reasonable theory as to why the variances changes from invocation to invocation.
The code under test is a machine-learning algorithm (LSTM->Dense), using hand-optimized neon intrinsics used to generate real-time audio. The bulk of the execution (~90%) is matrix and vector arithmetic using hand-optimized neon intrinsics. Data footprint is about 13kb (fits comfortably in L1 d-cache). Code footprint is unknown, but may not fit in L1 i-cache. Most of the code pipelines beautifully, so code may be running close L1-cache bandwidth limits. So far, optimizaton has resulted in an improvement from ~0.18 x realtime, to 0.093 x realtime. I think there's probably another ~15% improvement available, but the timing inaccuracies are getting in the way at this point. The code under test gets executed three times, taking ~0.3 x realtime, so further optimizations are in fact critical.
Things that have been checked:
Not a NEON alignment problem. All matrices, matrix rows, and vectors are 16-byte aligned (checked with asserts in debug compiles).
Not a CPU Frequency issue. The CPU scaling governors have been set to
performance
, and all CPUs are running at 1.8Ghz.I don't think it's related to cache competition between processes. HTOP indicates ~6% CPU use at idle when connected by VNC, and about 0.3% (wifi supplicant) when connected via ssh. The pattern doesn't significantly change when connected via SSH.
I don't think it varies depending on which CPU core the code is running on -- although I'm only able to determine which core the code is running on in a particular run using HTOP, which isn't completely definitive. Test runs seems to occasionally get shifted to a different CPU core, but for the most part, they seem to run on a single randomly selected core for the duration of the 3 tests per execution run.
I don't think it's heat throttling. CPU temperature is a very modest 47C. And I don't think Raspberry PI 4s heat throttle until they get to 80C.
vector operations rely on GCC compiler auto-vectorization, which have been properly annotated with restrict declarations, and verified to have produce optimal neon vectorization (with better instruction scheduling than I could produce with Neon intrinsics).
Not a timer resolution problem. Consecutive calls to
std::chrono::steady_clock::now()
produce increments of between 37 and 56ns.Not a problem with choice of clock. steady_clock, system_clock, and high_resolution_clock all exhibit the same behaviour.
Verified cpu frequency:
$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
performance
performance
$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
1800000
1800000
1800000
1800000
Things that I don't know that you may be able to help with:
How the std::chrono::steady_clock is implemented on Raspberry Pi. Is it based on a CPU clock counter? Any details appreciated.
Whether heat throttling is reflected in /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq. I think it is, but I'm not certain.
I'm clearly missing something important.
Technical Details:
- Raspberry Pi 4b 8GB
- Linux raspberrypi 5.15.61-v8+ #1579 SMP PREEMPT Fri Aug 26 11:16:44 BST 2022 aarch64 GNU/Linux
- gcc version 10.2.1 20210110 (Debian 10.2.1-6)
- Tests run under the catch2 test framework.