Why does std::allocator perform so much better than a custom allocator on Linux in some scenarios?

Question

Working on a custom allocator for use with STL, I discovered that there are some scenarios where std::allocator substantially outperforms any custom allocator I've tried on some Linux platforms.

On Windows, using both VC++ 2019 and clang, I do not see any appreciable differences in speed.

What I cannot understand is why I am seeing these vastly different results on Linux platforms (I have tried both Fedora and Ubuntu.) I created this program that demonstrates what I'm seeing (borrowing the SampleAllocator presented in a separate question referenced in the code.)

#include <vector>
#include <chrono>
#include <iostream>

// SimpleAllocator code from:
// https://stackoverflow.com/questions/22487267/unable-to-use-custom-allocator-with-allocate-shared-make-shared
template <class Tp>
struct SimpleAllocator
{
  typedef Tp value_type;
  SimpleAllocator() {}
  template <class T> SimpleAllocator(const SimpleAllocator<T>& other) {}
  Tp* allocate(std::size_t n) { return static_cast<Tp*>(::operator new(n * sizeof(Tp))); }
  void deallocate(Tp* p, std::size_t n) { ::operator delete(p); }
};
template <class T, class U>
bool operator==(const SimpleAllocator<T>&, const SimpleAllocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const SimpleAllocator<T>&, const SimpleAllocator<U>&) { return false; }

template <typename T> void TimeInsertions(T &vec, const std::string &alloc_name)
{
    auto start_time = std::chrono::steady_clock::now();
    for (int i = 0 ; i<=100000000; i++)
    {
        vec.push_back(i);
    }
    auto end_time = std::chrono::steady_clock::now();

    std::cout << "Time using " << alloc_name << ": "
              << std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count()
              << "ms" << std::endl;
}

int main()
{
    {
        std::vector<int, SimpleAllocator<int>> vec;
        TimeInsertions(vec, "SampleAllocator");
    }

    {
        std::vector<int> vec;
        TimeInsertions(vec, "std::allocator");
    }
}

Given this basic example, I expected to see the referenced SimpleAllocator perform about the same as std::allocator, but what I actually see are results like this:

$ ./sample
Time using SampleAllocator: 5283ms
Time using std::allococator: 1485ms

These results vary by machine, of course, but I get similarly very different results on different Linux machines. That leads me to believe there is some magic in g++ or Linux that I do not fully understand. Can anyone provide any insight to help me understand what I'm seeing?

EDIT

Coming back to this today, perhaps this might have something to do with compiler optimizations. I re-compiled the code on Linux using gcc's -O3 flag and I got very different (and closer) results:

$ ./sample
Time using SampleAllocator: 341ms
Time using std::allocator: 479ms

So, perhaps this just has to do with how the STL code is compiled and not anything to do with specific platform optimizations.

_"That leads me to believe there is some magic in g++ or Linux that I do not fully understand."_ _Magic_ is the least thing that will be involved. — πάντα ῥεῖ, Jun 15 '19 at 03:59
The standard allocator is designed and written to be good enough in almost all situations. You should only use a custom allocator in very narrow and specific cases where the standard allocation *isn't* good enough for your specific use-case — Some programmer dude, Jun 15 '19 at 04:00
I can't tell you *why* but the assemblies from g++ [here](https://godbolt.org/z/eejc06) show (to my naive eye) that it is able to optimize the standard allocator version much more than the custom one. My guess is that it is able to make much more assumptions about std::vector+std::allocator scenario. — kmdreko, Jun 15 '19 at 06:38
I voted to close as "too broad", since the only way to answer would be to examine inner workings of the compilers, code in their associated standard libraries, and any interaction (implementation-defined or compiler specific magic) between them. A compiler can certainly treat standard library types differently than user-defined types, since compiler developers have visibility of the specification of the standard library. You could always examine source of g++ and clang compilers to see what they do differently. Convincing Microsoft to give source for their compiler might be more difficult. — Peter, Jun 15 '19 at 09:21
What is your actual question now, following the edit to your post which now shows times which are not "vastly different"? — skomisa, Jun 16 '19 at 04:05
The question should now be: how come SampleAllocator gives better performance than std::allocator? — Marc Glisse, Jun 16 '19 at 15:35
And the things slowing std::allocator down seem to be: checking for overflow, and sized deallocation. If I remove both, the code is still very different (memcpy vs vectorized code), but the perf is similar. — Marc Glisse, Jun 16 '19 at 16:26
Skomisa, I think I have the answer now, which is that the differences were largely due to compiler optimizations. I'm still curious why no optimization settings produced such different results in Linux. — paulej, Jun 17 '19 at 13:13
@paulej Perhaps some parts used by `std::allocator` are already compiled into the c++ library (with optimizations of course). It's rarely meaningful to compare unoptimized performance. — Ted Lyngmo, Jun 19 '19 at 17:52
@TedLyngmo yeah, I think that is the most likely explanation. When I explicitly specified the optimization flags, the results were more consistent. — paulej, Jun 20 '19 at 18:40

Why does std::allocator perform so much better than a custom allocator on Linux in some scenarios?

0 Answers0