std::shared_ptr vs std::make_shared: unexpected cache misses and branch prediction

Question

I'm trying to measure the efficiency of pointers created by std::shared_ptr and std::make_shared.

I have the next testing code:

#include <iostream>
#include <memory>
#include <vector>


struct TestClass {
    TestClass(int _i) : i(_i) {}
    int i = 1;
};

void sum(const std::vector<std::shared_ptr<TestClass>>& v) {
    unsigned long long s = 0u;
    for(size_t i = 0; i < v.size() - 1; ++i) {
        s += v[i]->i * v[i + 1]->i;
    }
    std::cout << s << '\n';
}

void test_shared_ptr(size_t n) {
    std::cout << __FUNCTION__ << "\n";
    std::vector<std::shared_ptr<TestClass>> v;
    v.reserve(n);
    for(size_t i = 0u; i < n; ++i) {
        v.push_back(std::shared_ptr<TestClass>(new TestClass(i)));
    }
    sum(v);
}

void test_make_shared(size_t n) {
    std::cout << __FUNCTION__ << "\n";
    std::vector<std::shared_ptr<TestClass>> v;
    v.reserve(n);
    for(size_t i = 0u; i < n; ++i) {
        v.push_back(std::make_shared<TestClass>(i));
    }
    sum(v);
}

int main(int argc, char *argv[]) {
    size_t n = (argc == 3 ) ? atoi(argv[2]) : 100;
    if(atoi(argv[1]) == 1) {
        test_shared_ptr(n);
    } else {
        test_make_shared(n);
    }
    return 0;
}

Compiled as g++ -W -Wall -O2 -g -std=c++14 main.cpp -o cache_misses.bin

I run the test with std::shared_ptr constructor and check the results with valgrind:

valgrind --tool=cachegrind --branch-sim=yes ./cache_misses.bin 1 100000
==2005== Cachegrind, a cache and branch-prediction profiler
==2005== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==2005== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==2005== Command: ./cache_misses.bin 1 100000
==2005==
--2005-- warning: L3 cache found, using its data for the LL simulation.
--2005-- warning: specified LL cache: line_size 64  assoc 12  total_size 9,437,184
--2005-- warning: simulated LL cache: line_size 64  assoc 18  total_size 9,437,184
test_shared_ptr
18107093611968
==2005==
==2005== I   refs:      74,188,102
==2005== I1  misses:         1,806
==2005== LLi misses:         1,696
==2005== I1  miss rate:       0.00%
==2005== LLi miss rate:       0.00%
==2005==
==2005== D   refs:      26,099,141  (15,735,722 rd   + 10,363,419 wr)
==2005== D1  misses:       392,064  (   264,583 rd   +    127,481 wr)
==2005== LLd misses:       134,416  (     7,947 rd   +    126,469 wr)
==2005== D1  miss rate:        1.5% (       1.7%     +        1.2%  )
==2005== LLd miss rate:        0.5% (       0.1%     +        1.2%  )
==2005==
==2005== LL refs:          393,870  (   266,389 rd   +    127,481 wr)
==2005== LL misses:        136,112  (     9,643 rd   +    126,469 wr)
==2005== LL miss rate:         0.1% (       0.0%     +        1.2%  )
==2005==
==2005== Branches:      12,732,402  (11,526,905 cond +  1,205,497 ind)
==2005== Mispredicts:       16,055  (    15,481 cond +        574 ind)
==2005== Mispred rate:         0.1% (       0.1%     +        0.0%   )

And with std::make_shared:

valgrind --tool=cachegrind --branch-sim=yes ./cache_misses.bin 2 100000
==2014== Cachegrind, a cache and branch-prediction profiler
==2014== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==2014== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==2014== Command: ./cache_misses.bin 2 100000
==2014==
--2014-- warning: L3 cache found, using its data for the LL simulation.
--2014-- warning: specified LL cache: line_size 64  assoc 12  total_size 9,437,184
--2014-- warning: simulated LL cache: line_size 64  assoc 18  total_size 9,437,184
test_make_shared
18107093611968
==2014==
==2014== I   refs:      41,283,983
==2014== I1  misses:         1,805
==2014== LLi misses:         1,696
==2014== I1  miss rate:       0.00%
==2014== LLi miss rate:       0.00%
==2014==
==2014== D   refs:      14,997,474  (8,834,690 rd   + 6,162,784 wr)
==2014== D1  misses:       241,781  (  164,368 rd   +    77,413 wr)
==2014== LLd misses:        84,413  (    7,943 rd   +    76,470 wr)
==2014== D1  miss rate:        1.6% (      1.9%     +       1.3%  )
==2014== LLd miss rate:        0.6% (      0.1%     +       1.2%  )
==2014==
==2014== LL refs:          243,586  (  166,173 rd   +    77,413 wr)
==2014== LL misses:         86,109  (    9,639 rd   +    76,470 wr)
==2014== LL miss rate:         0.2% (      0.0%     +       1.2%  )
==2014==
==2014== Branches:       7,031,695  (6,426,222 cond +   605,473 ind)
==2014== Mispredicts:      216,010  (   15,442 cond +   200,568 ind)
==2014== Mispred rate:         3.1% (      0.2%     +      33.1%   )

As you may see cache miss and branch misprediction rates are higher when I use std::make_shared. I'd expect std::make_shared to be more effective because both stored object and the control block are located in the same memory block. Or at least the performance should be the same.

What do I miss?

Environment details:

$ g++ --version
g++ (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Do you get consistent results if you run the test several times? — François Andrieux, Jan 08 '20 at 20:54
It's implementation-dependent. Try testing with other compilers. — Michael Chourdakis, Jan 08 '20 at 20:54
Isn't cachegrind just *simulating*, not measuring? https://valgrind.org/docs/manual/cg-manual.html#branch-sim says *Conditional branches are predicted using an array of 16384 2-bit saturating counters.* and that it's supposed to represent a typical desktop/server from 2004. That simplistic branch prediction is a joke by modern standards, and was overly simplistic for high-performance x86 even in 2004; see https://danluu.com/branch-prediction/. Intel since Haswell uses IT-TAGE. — Peter Cordes, Jan 08 '20 at 21:08
What happens if you use `perf` instead to check the actual performance? — NathanOliver, Jan 08 '20 at 21:14
For real-life performance measurements, you might like to take a look at this: https://developer.amd.com/amd-uprof/ — Paul Sanders, Jan 08 '20 at 21:30

score 7 · Accepted Answer · answered Jan 08 '20 at 21:15

Isn't cachegrind just simulating, not measuring? https://valgrind.org/docs/manual/cg-manual.html#branch-sim says Conditional branches are predicted using an array of 16384 2-bit saturating counters. and that it's supposed to represent a typical desktop/server from 2004.

That simplistic branch prediction with 2-bit saturating counters is a joke by modern standards, and was overly simplistic for high-performance CPUs even in 2004; Pentium II/III had a 2-level adaptive local/global predictor with 4 bits per entry of local history, according to https://danluu.com/branch-prediction/. See also https://agner.org/optimize/; Agner's microarch PDF has a chapter on branch prediction near the start.

Intel since Haswell uses IT-TAGE, and modern AMD also uses advanced branch prediction techniques.

I wouldn't be surprised if you have a couple branches that happen to alias each other in valgrind's simulation, leading to mispredicts for the one that runs less frequently.

Have you tried using real HW perf counters? e.g. on Linux:
perf stat -d ./cache_misses.bin 2 100000 should give you a more realistic picture for real hardware, including real L1d miss rate and branch prediction miss rate. perf events like branches and branch-misses map to some specific HW counter depending on CPU microarchitecture. perf list will show you the counters available.

I often use taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,instructions:u,branches:u,branch-misses:u,uops_issued.any:u,uops_executed.thread:u -r 2 ./program_under_test on my Skylake CPU.

(Actually I usually leave out branch-misses because I'm often tuning a SIMD loop that doesn't have unpredictable branches, and there are a limited number of HW counters that can be programmed to count different events.)

std::shared_ptr vs std::make_shared: unexpected cache misses and branch prediction

1 Answers1