Consider the following class, created mainly for benchmarking purposes:
class String {
char* data_;
public:
String(const char* arg = "") : data_(new char[strlen(arg) + 1]) { strcpy(data_, arg); }
String(const String& other) : String(other.data_) { }
String(String&& other) noexcept : data_(other.data_) { other.data_ = nullptr; }
String& operator=(String other) noexcept { swap(other); return *this; }
~String() { delete[] data_; }
void swap(String& rhs) noexcept { std::swap(data_, rhs.data_); }
const char* data() const { return data_; }
};
void swap(String& lhs, String& rhs) noexcept { lhs.swap(rhs); }
I am trying to compare the efficiency of swapping of two of its instances with custom swap
and std::swap
. For custom swap
, GCC 8.2 (-O2
) generates the following x86_64 assembly:
mov rax, QWORD PTR [rdi]
mov rdx, QWORD PTR [rsi]
mov QWORD PTR [rdi], rdx
mov QWORD PTR [rsi], rax
ret
which exactly matches swapping of two pointers. However, for std::swap
, the generated assembly is:
mov rdx, QWORD PTR [rsi]
mov QWORD PTR [rsi], 0 // (A)
mov rax, QWORD PTR [rdi]
mov QWORD PTR [rdi], 0 // (1)
mov QWORD PTR [rsi], rax // (B)
mov rax, QWORD PTR [rdi] // (2)
mov QWORD PTR [rdi], rdx
test rax, rax // (3)
je .L3
mov rdi, rax
jmp operator delete[](void*)
.L3:
ret
What I am curious about is why GCC generates such inefficient code. The instruction (1) sets [rdi]
to zero. That zero is then loaded into rax
(2). And then, rax
is tested (3) whether or not operator delete
should be called.
Why GCC tests rax
if it is guaranteed to be zero? It seems to be a pretty simple case for the optimizer to avoid this test.
Godbolt demo: https://godbolt.org/z/WNm2if
Another source of inefficiency is that 0 is written to [rsi]
first (A) and then it is overwritten with another value (B).
Bottom line: I would expect a compiler to generate the same machine code for std::swap
as well as for custom swap
, which does not happen. This indicates that writing custom swapping functions makes sense even for classes that support move semantics.