I'm trying to measure the efficiency of pointers created by std::shared_ptr
and std::make_shared
.
I have the next testing code:
#include <iostream>
#include <memory>
#include <vector>
struct TestClass {
TestClass(int _i) : i(_i) {}
int i = 1;
};
void sum(const std::vector<std::shared_ptr<TestClass>>& v) {
unsigned long long s = 0u;
for(size_t i = 0; i < v.size() - 1; ++i) {
s += v[i]->i * v[i + 1]->i;
}
std::cout << s << '\n';
}
void test_shared_ptr(size_t n) {
std::cout << __FUNCTION__ << "\n";
std::vector<std::shared_ptr<TestClass>> v;
v.reserve(n);
for(size_t i = 0u; i < n; ++i) {
v.push_back(std::shared_ptr<TestClass>(new TestClass(i)));
}
sum(v);
}
void test_make_shared(size_t n) {
std::cout << __FUNCTION__ << "\n";
std::vector<std::shared_ptr<TestClass>> v;
v.reserve(n);
for(size_t i = 0u; i < n; ++i) {
v.push_back(std::make_shared<TestClass>(i));
}
sum(v);
}
int main(int argc, char *argv[]) {
size_t n = (argc == 3 ) ? atoi(argv[2]) : 100;
if(atoi(argv[1]) == 1) {
test_shared_ptr(n);
} else {
test_make_shared(n);
}
return 0;
}
Compiled as g++ -W -Wall -O2 -g -std=c++14 main.cpp -o cache_misses.bin
I run the test with std::shared_ptr
constructor and check the results with valgrind:
valgrind --tool=cachegrind --branch-sim=yes ./cache_misses.bin 1 100000
==2005== Cachegrind, a cache and branch-prediction profiler
==2005== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==2005== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==2005== Command: ./cache_misses.bin 1 100000
==2005==
--2005-- warning: L3 cache found, using its data for the LL simulation.
--2005-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--2005-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
test_shared_ptr
18107093611968
==2005==
==2005== I refs: 74,188,102
==2005== I1 misses: 1,806
==2005== LLi misses: 1,696
==2005== I1 miss rate: 0.00%
==2005== LLi miss rate: 0.00%
==2005==
==2005== D refs: 26,099,141 (15,735,722 rd + 10,363,419 wr)
==2005== D1 misses: 392,064 ( 264,583 rd + 127,481 wr)
==2005== LLd misses: 134,416 ( 7,947 rd + 126,469 wr)
==2005== D1 miss rate: 1.5% ( 1.7% + 1.2% )
==2005== LLd miss rate: 0.5% ( 0.1% + 1.2% )
==2005==
==2005== LL refs: 393,870 ( 266,389 rd + 127,481 wr)
==2005== LL misses: 136,112 ( 9,643 rd + 126,469 wr)
==2005== LL miss rate: 0.1% ( 0.0% + 1.2% )
==2005==
==2005== Branches: 12,732,402 (11,526,905 cond + 1,205,497 ind)
==2005== Mispredicts: 16,055 ( 15,481 cond + 574 ind)
==2005== Mispred rate: 0.1% ( 0.1% + 0.0% )
And with std::make_shared
:
valgrind --tool=cachegrind --branch-sim=yes ./cache_misses.bin 2 100000
==2014== Cachegrind, a cache and branch-prediction profiler
==2014== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==2014== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==2014== Command: ./cache_misses.bin 2 100000
==2014==
--2014-- warning: L3 cache found, using its data for the LL simulation.
--2014-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--2014-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
test_make_shared
18107093611968
==2014==
==2014== I refs: 41,283,983
==2014== I1 misses: 1,805
==2014== LLi misses: 1,696
==2014== I1 miss rate: 0.00%
==2014== LLi miss rate: 0.00%
==2014==
==2014== D refs: 14,997,474 (8,834,690 rd + 6,162,784 wr)
==2014== D1 misses: 241,781 ( 164,368 rd + 77,413 wr)
==2014== LLd misses: 84,413 ( 7,943 rd + 76,470 wr)
==2014== D1 miss rate: 1.6% ( 1.9% + 1.3% )
==2014== LLd miss rate: 0.6% ( 0.1% + 1.2% )
==2014==
==2014== LL refs: 243,586 ( 166,173 rd + 77,413 wr)
==2014== LL misses: 86,109 ( 9,639 rd + 76,470 wr)
==2014== LL miss rate: 0.2% ( 0.0% + 1.2% )
==2014==
==2014== Branches: 7,031,695 (6,426,222 cond + 605,473 ind)
==2014== Mispredicts: 216,010 ( 15,442 cond + 200,568 ind)
==2014== Mispred rate: 3.1% ( 0.2% + 33.1% )
As you may see cache miss and branch misprediction rates are higher when I use std::make_shared
.
I'd expect std::make_shared
to be more effective because both stored object and the control block are located in the same memory block. Or at least the performance should be the same.
What do I miss?
Environment details:
$ g++ --version
g++ (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.