Why std::unique_ptr isn't optimized while std::variant can?

Question

I tried to compare the overhead of std::visit(std::variant polymorphism) and virtual function(std::unique_ptr polymorphism).(please note my question is not about overhead or performance, but optimization.) Here is my code. https://quick-bench.com/q/pJWzmPlLdpjS5BvrtMb5hUWaPf0

#include <memory>
#include <variant>

struct Base
{
  virtual void Process() = 0;
};

struct Derived : public Base
{
  void Process() { ++a; }
  int a = 0;
};

struct VarDerived
{
  void Process() { ++a; }
  int a = 0;
};

static std::unique_ptr<Base> ptr;
static std::variant<VarDerived> var;

static void PointerPolyMorphism(benchmark::State& state)
{
  ptr = std::make_unique<Derived>();
  for (auto _ : state)
  {
    for(int i = 0; i < 1000000; ++i)
      ptr->Process();
  }
}
BENCHMARK(PointerPolyMorphism);

static void VariantPolyMorphism(benchmark::State& state)
{
  var.emplace<VarDerived>();
  for (auto _ : state)
  {
    for(int i = 0; i < 1000000; ++i)
      std::visit([](auto&& x) { x.Process();}, var);
  }
}
BENCHMARK(VariantPolyMorphism);

I know it's not good benchmark test, it was only draft during my test. But I was surprised at the result. std::visit benchmark was high(which means slow) without any optimization. But When I turn on optimization (higher than O2), std::visit benchmark is extremely low(which means extremely fast) while std::unique_ptr isn't. I'm wondering why the same optimization can't be applied to the std::unique_ptr polymorphism?

you are comparing two codes that do different things and wonder why they perform different ? — 463035818_is_not_an_ai, Oct 16 '20 at 13:14
Are you comparing apples with wheelchairs? I have absolutely no clue what you're talking about, or what you expect. — πάντα ῥεῖ, Oct 16 '20 at 13:14
`std::variant` and virtual functions work in completely different ways. It's unclear what kind of optimizations your are expecting that would apply to both — UnholySheep, Oct 16 '20 at 13:16
btw the name `RuntimePolyMorphism` is a bit misleading, because your `PointerPolyMorphism` is runtime polymorphism — 463035818_is_not_an_ai, Oct 16 '20 at 13:17
Perhaps your question would be clearer if you wrote out for yourself (not as part of this question) an argument why the same optimization *can* be applied? The process should help you clarify and formalize your thoughts, which could then help you explain to us (as part of this question) what you are trying to ask. — JaMiT, Oct 16 '20 at 13:27
variants are not polymorphic. They contain only one object at a time. You access `VarDerived` through static typing, not dynamic typing. — Raildex, Oct 16 '20 at 13:35
`std::variant` inlines to a union and a jump table. You're only using one type, so I wouldn't be surprised if the whole branch could be elided. This example is _so simple_ that for me, gcc optimized the entire loop and just assigned 1,000,000 to the member. https://godbolt.org/z/dhqWY5 — parktomatomi, Oct 16 '20 at 13:39
@JaMiT I don't mean anything what it processes but only I want to know is what is the important information for the compiler to optimize. Both Derived and VarDerived has the same interfaces and the same processes, but compliler couldn't optimize std::unique_ptr one. That's my question. Hope it makes sense! — Dickless, Oct 16 '20 at 13:39
@Dickless Sorry, no that comment is less clear / makes less sense than your question. — JaMiT, Oct 16 '20 at 13:41
@parktomatomi You are right. I changed the code 'static std::variant var;' declaring 'AnotherVarDerived' which has same interfaces with 'VarDerived'. And 'VariantPolyMorphism's optimization didn't happen. If the number of the "derived" classes is the matter, there was only one class that inherits 'Base' class. So 'PointerPolyMorphism ' one may also be optimized, but it wasn't. Why? Like other said, they are differently optimized because of the static typing or the dynamic typing? I don't understand what is these two typing. Could anyone tell me deeper? — Dickless, Oct 16 '20 at 14:07
C++ is statically typed, always. Both variants and virtual polymorphism are methods of dynamic _dispatch_. Virtual polymorphism, beneath the `virtual` concept and `vtable`, uses indirection for dispatch. The address of a virtual method is stored in a variable. When you call it, that address is read, and the CPU jumps to that location. `std::variant` uses branching: the type is stored in a variable, and when you call `std::visit`, it generates the equivalent of a switch block to call the right function. That's easier for a compiler to optimize, but with many types it generates much more code. — parktomatomi, Oct 16 '20 at 15:18

score 3 · Accepted Answer · answered Oct 16 '20 at 13:50

I've compiled your code with Clang++ to LLVM (without your benchmarking) with -Ofast. Here's what you get for VariantPolyMorphism, unsurprisingly:

define void @_Z19VariantPolyMorphismv() local_unnamed_addr #2 {
  ret void
}

On the other hand, PointerPolyMorphism does really execute the loop and all calls:

define void @_Z19PointerPolyMorphismv() local_unnamed_addr #2 personality i32 (...)* @__gxx_personality_v0 {
  %1 = tail call dereferenceable(16) i8* @_Znwm(i64 16) #8, !noalias !8
  tail call void @llvm.memset.p0i8.i64(i8* nonnull align 16 dereferenceable(16) %1, i8 0, i64 16, i1 false), !noalias !8
  %2 = bitcast i8* %1 to i32 (...)***
  store i32 (...)** bitcast (i8** getelementptr inbounds ({ [3 x i8*] }, { [3 x i8*] }* @_ZTV7Derived, i64 0, inrange i32 0, i64 2) to i32 (...)**), i32 (...)*** %2, align 8, !tbaa !11, !noalias !8
  %3 = getelementptr inbounds i8, i8* %1, i64 8
  %4 = bitcast i8* %3 to i32*
  store i32 0, i32* %4, align 8, !tbaa !13, !noalias !8
  %5 = load %struct.Base*, %struct.Base** getelementptr inbounds ({ { %struct.Base* } }, { { %struct.Base* } }* @_ZL3ptr, i64 0, i32 0, i32 0), align 8, !tbaa !4
  store i8* %1, i8** bitcast ({ { %struct.Base* } }* @_ZL3ptr to i8**), align 8, !tbaa !4
  %6 = icmp eq %struct.Base* %5, null
  br i1 %6, label %7, label %8

7:                                                ; preds = %8, %0
  br label %11

8:                                                ; preds = %0
  %9 = bitcast %struct.Base* %5 to i8*
  tail call void @_ZdlPv(i8* %9) #7
  br label %7

10:                                               ; preds = %11
  ret void

11:                                               ; preds = %7, %11
  %12 = phi i32 [ %17, %11 ], [ 0, %7 ]
  %13 = load %struct.Base*, %struct.Base** getelementptr inbounds ({ { %struct.Base* } }, { { %struct.Base* } }* @_ZL3ptr, i64 0, i32 0, i32 0), align 8, !tbaa !4
  %14 = bitcast %struct.Base* %13 to void (%struct.Base*)***
  %15 = load void (%struct.Base*)**, void (%struct.Base*)*** %14, align 8, !tbaa !11
  %16 = load void (%struct.Base*)*, void (%struct.Base*)** %15, align 8
  tail call void %16(%struct.Base* %13)
  %17 = add nuw nsw i32 %12, 1
  %18 = icmp eq i32 %17, 1000000
  br i1 %18, label %10, label %11
}

The reason for this is that both your variables are static. This allows the compiler to infer that no code outside the translation unit has access to your variant instance. Therefore your loop doesn't have any visible effect and can be safely removed. However, although your smart pointer is static, the memory it points to could still change (as a side-effect of the call to Process, for example). The compiler can therefore not easily prove that is safe to remove the loop and doesn't.

If you remove the static from both VariantPolyMorphism you get:

define void @_Z19VariantPolyMorphismv() local_unnamed_addr #2 {
  store i32 0, i32* getelementptr inbounds ({ { %"union.std::__1::__variant_detail::__union", i32 } }, { { %"union.std::__1::__variant_detail::__union", i32 } }* @var, i64 0, i32 0, i32 1), align 4, !tbaa !16
  store i32 1000000, i32* getelementptr inbounds ({ { %"union.std::__1::__variant_detail::__union", i32 } }, { { %"union.std::__1::__variant_detail::__union", i32 } }* @var, i64 0, i32 0, i32 0, i32 0, i32 0, i32 0), align 4, !tbaa !18
  ret void
}

Which isn't surprising once again. The variant can only contain VarDerived so nothing needs to be computed at run-time: The final state of the variant can already be determined at compile-time. The difference, though, now is that some other translation unit might want to access the value of var later on and the value must therefore be written.

Marek R · Answer 2 · 2020-10-16T14:11:01.643

Your variant can store only singe type, so this is same as single regular variable (it is working more like an optional).
You are running test without optimizations enabled
Result is not secured from optimizer so it can trash your code.
Your code actually do not utilizes polymorphism, some compilers are able to figure out that there is only one implementation of Base class and drop virtual calls.

This is better but still not trustworthy: ver 1, ver 2 with arrays.

Yes polymorphism can be expensive when used in tight loops.

Witting benchmarks for such small extremely fast features is hard and full of pitfalls, so must be approached with extreme caution, since you reaching limitations of benchmark tool.

Why std::unique_ptr isn't optimized while std::variant can?

2 Answers2