13

Consider the following class, created mainly for benchmarking purposes:

class String {
  char* data_; 
public:
  String(const char* arg = "") : data_(new char[strlen(arg) + 1]) { strcpy(data_, arg); }
  String(const String& other) : String(other.data_) { }
  String(String&& other) noexcept : data_(other.data_) { other.data_ = nullptr; }
  String& operator=(String other) noexcept { swap(other); return *this; }  
  ~String() { delete[] data_; }
  void swap(String& rhs) noexcept { std::swap(data_, rhs.data_); }
  const char* data() const { return data_; }   
};
void swap(String& lhs, String& rhs) noexcept { lhs.swap(rhs); }

I am trying to compare the efficiency of swapping of two of its instances with custom swap and std::swap. For custom swap, GCC 8.2 (-O2) generates the following x86_64 assembly:

mov     rax, QWORD PTR [rdi]
mov     rdx, QWORD PTR [rsi]
mov     QWORD PTR [rdi], rdx
mov     QWORD PTR [rsi], rax
ret

which exactly matches swapping of two pointers. However, for std::swap, the generated assembly is:

  mov     rdx, QWORD PTR [rsi]
  mov     QWORD PTR [rsi], 0    // (A)
  mov     rax, QWORD PTR [rdi]
  mov     QWORD PTR [rdi], 0    // (1)
  mov     QWORD PTR [rsi], rax  // (B)
  mov     rax, QWORD PTR [rdi]  // (2)
  mov     QWORD PTR [rdi], rdx
  test    rax, rax              // (3)
  je      .L3
  mov     rdi, rax
  jmp     operator delete[](void*) 
.L3:
  ret

What I am curious about is why GCC generates such inefficient code. The instruction (1) sets [rdi] to zero. That zero is then loaded into rax (2). And then, rax is tested (3) whether or not operator delete should be called.

Why GCC tests rax if it is guaranteed to be zero? It seems to be a pretty simple case for the optimizer to avoid this test.

Godbolt demo: https://godbolt.org/z/WNm2if


Another source of inefficiency is that 0 is written to [rsi] first (A) and then it is overwritten with another value (B).

Bottom line: I would expect a compiler to generate the same machine code for std::swap as well as for custom swap, which does not happen. This indicates that writing custom swapping functions makes sense even for classes that support move semantics.

Daniel Langr
  • 22,196
  • 3
  • 50
  • 93
  • At first glance, if rsi == rdi it's not guaranteed to be zero. But I can't explain what it's trying to do. – Rup Oct 15 '18 at 10:21
  • @Rup `rsi` and `rdi` point to swapped objects. Generally, they are different of course. But it's not relevant to my question. – Daniel Langr Oct 15 '18 at 10:27
  • Sorry, yes. I meant I can't explain why it would be testing equality at all, or testing it like that or calling delete if they are equal. It's relevant to your question "Why GCC tests rax if it is guaranteed to be zero" because rax is not guaranteed to be zero when rdi == rsi. Yes rdi and rsi ought to be different, but short of some guarantee they point to different objects how is the compiler to say that? – Rup Oct 15 '18 at 10:29
  • Perhaps the compiler does it whatever was the original size of the type. If it was eax instead of rax then it would be a good thing to set it to 0 because you would not replace the 32 bits on the left. That's just my guess though and I also add very poor code generated with Godbolt gcc compiler for c++ like: mov QWORD PTR [rbp-8], rdi mov QWORD PTR [rbp-16], rsi mov rdx, QWORD PTR [rbp-8] mov rax, QWORD PTR [rbp-16] – Antonin GAVREL Oct 15 '18 at 10:29
  • Changing the copy-assignment to `String& operator=(String&& other)` fixes it.Your version creates an extra object for each such call, and `std::swap` works by a series of 3 `operator=` calls if no specialized version exists – M.M Oct 15 '18 at 10:36
  • 4
    [Scott Meyers wrote](http://scottmeyers.blogspot.com/2014/06/the-drawbacks-of-implementing-move.html) about the drawbacks of copy and swap in the world of move semantics. That's one of the drawbacks. – StoryTeller - Unslander Monica Oct 15 '18 at 10:42
  • 1
    Even in OP's original code I don't see why the compiler fails to optimize out `test rax, rax` (since it just loaded `rax` with `0`). Could be an optimization bug – M.M Oct 15 '18 at 10:49
  • By the way, the standard library tries ADL swap before `std::swap`, so if your code provides `swap` member function then standard library ops won't run into the problem (and nor will well-written user code that does ADL swap) – M.M Oct 15 '18 at 10:51
  • 1
    @M.M In the asm code, if `rsi` and `rdi` point to two 8byte memory with overlap but not identical, the `test` can't be optimized away. – llllllllll Oct 15 '18 at 11:00
  • @liliscent This case didn't come to my mind. But it's likely the reason and I would accept it as an answer. This also explains why in [this benchmark](http://quick-bench.com/gQNlsAsBUx9Orlnkuqw_ogqapnE) both `std::swap` and custom `swap` yield the same performance, since a compiler is likely able to recognize that both swapped pointers do not overlap / are different. – Daniel Langr Oct 15 '18 at 11:10
  • If the two data members partially overlap that way then at least one of them is misaligned which the compiler is normally allowed to assumed is impossible – harold Oct 15 '18 at 11:26
  • @harold Yes, that's why I put that as a comment. I'm not sure whether this should be considered as a optimization bug. But the situation does seem to be a little complex for the compiler to figure out. – llllllllll Oct 15 '18 at 11:29
  • 5
    Add `__restrict` to the 2 arguments, and notice the difference. – Marc Glisse Oct 15 '18 at 12:10
  • @MarcGlisse Great, it works: https://godbolt.org/z/W_t4ZK. I didn't know GCC supported `__restrict` for references. Moreover, it works for Clang as well. – Daniel Langr Oct 15 '18 at 12:18
  • 5
    There is no issue of partial overlap, the compiler has trouble untangling the case where lhs and rhs are the same. To see that rax contains 0: it is read from rdi, which if rdi!=rsi had 0 last written to it, but otherwise had rax which itself had [rdi] which itself had 0 because rdi==rsi. Not trivial. (btw, the file created by -fdump-tree-optimized is more readable than asm) – Marc Glisse Oct 15 '18 at 12:21
  • 4
    I [reported this](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80617) once. It used to have more branches! – Davis Herring Oct 15 '18 at 13:26
  • @DavisHerring that sounds like an answer. – Shafik Yaghmour Oct 16 '18 at 01:52

2 Answers2

2

The instruction (1) sets [rdi] to zero. That zero is then loaded into rax (2).

This isn't always the case, because rax is copied to [rsi] first, and there is no guarantee that [rsi] and [rdi] are different. So gcc cannot make this assumption, and it doesn't. Neither does any other compiler.

Edit (extracted from the comments): If [rsi] and [rdi] are the same, then [rdi] contains zero anyway (a "different" one that comes from Instruction A) and if they partially overlap, that's undefined behaviour and the compiler can assume it will never happen. But the optimiser doesn't see this far.

This indicates that writing custom swapping functions makes sense even for classes that support move semantics.

Write separate copy assignment and move assignment operators, and the problem goes away. This newfangled pass-by-copy assignment operator looks like a brilliant idea, but as we can see, compilers may have some trouble optimising it away---and other compilers have a lot of trouble. Just look at the assembly generated by MSVC or clang.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
-1

The reasoning is most likely that while the first code sample generates efficient code because there are no abstractions between the program and the desired executable (you are programming the swap).

In the second case, swap is a member of the standard library. which might have this check as a safety measure, which GCC simply did not optimize out.

Basically, I believe the standard library must have such safety checks. Writing code for a specific task will yield code optimized to the specific task. You do not need to cover all the possible cases that it might make sense to use swap.

Caribou
  • 2,070
  • 13
  • 29
wizard7377
  • 11
  • 3
  • 1
    In the source code of the primary template of `std::swap` provided by libstdc++, there are no safety checks: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/move.h#L196. The code is minimal. – Daniel Langr Apr 17 '23 at 13:35
  • While you are correct about the optimizer, I was making two separate points. 1. The standard library, by its nature, is not optimized to a specific use case, and 2. The optimizer is not able to optimize them out. These are two separate points, sorry if I didn’t make that clear. – wizard7377 Apr 17 '23 at 14:15
  • As the comments above suggest your answer may not be quite correct think about correcting it, also this question is 4 years old and probably didn't need a new answer. Consider the layout and wording when you compose a response as yours doesn't parse too well. – Caribou Apr 21 '23 at 14:00