Working on a custom allocator for use with STL, I discovered that there are some scenarios where std::allocator substantially outperforms any custom allocator I've tried on some Linux platforms.
On Windows, using both VC++ 2019 and clang, I do not see any appreciable differences in speed.
What I cannot understand is why I am seeing these vastly different results on Linux platforms (I have tried both Fedora and Ubuntu.) I created this program that demonstrates what I'm seeing (borrowing the SampleAllocator presented in a separate question referenced in the code.)
#include <vector>
#include <chrono>
#include <iostream>
// SimpleAllocator code from:
// https://stackoverflow.com/questions/22487267/unable-to-use-custom-allocator-with-allocate-shared-make-shared
template <class Tp>
struct SimpleAllocator
{
typedef Tp value_type;
SimpleAllocator() {}
template <class T> SimpleAllocator(const SimpleAllocator<T>& other) {}
Tp* allocate(std::size_t n) { return static_cast<Tp*>(::operator new(n * sizeof(Tp))); }
void deallocate(Tp* p, std::size_t n) { ::operator delete(p); }
};
template <class T, class U>
bool operator==(const SimpleAllocator<T>&, const SimpleAllocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const SimpleAllocator<T>&, const SimpleAllocator<U>&) { return false; }
template <typename T> void TimeInsertions(T &vec, const std::string &alloc_name)
{
auto start_time = std::chrono::steady_clock::now();
for (int i = 0 ; i<=100000000; i++)
{
vec.push_back(i);
}
auto end_time = std::chrono::steady_clock::now();
std::cout << "Time using " << alloc_name << ": "
<< std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count()
<< "ms" << std::endl;
}
int main()
{
{
std::vector<int, SimpleAllocator<int>> vec;
TimeInsertions(vec, "SampleAllocator");
}
{
std::vector<int> vec;
TimeInsertions(vec, "std::allocator");
}
}
Given this basic example, I expected to see the referenced SimpleAllocator perform about the same as std::allocator, but what I actually see are results like this:
$ ./sample
Time using SampleAllocator: 5283ms
Time using std::allococator: 1485ms
These results vary by machine, of course, but I get similarly very different results on different Linux machines. That leads me to believe there is some magic in g++ or Linux that I do not fully understand. Can anyone provide any insight to help me understand what I'm seeing?
EDIT
Coming back to this today, perhaps this might have something to do with compiler optimizations. I re-compiled the code on Linux using gcc's -O3
flag and I got very different (and closer) results:
$ ./sample
Time using SampleAllocator: 341ms
Time using std::allocator: 479ms
So, perhaps this just has to do with how the STL code is compiled and not anything to do with specific platform optimizations.