Reimplementing std::swap() with static tmp variable for simple types C++

Question

I decided to benchmark the realization of a swap function for simple types (like int, or struct, or a class that uses only simple types in its fields) with a static tmp variable in it to prevent memory allocation in each swap call. So I wrote this simple test program:

#include <iostream>
#include <chrono>
#include <utility>
#include <vector>


template<typename T>
void mySwap(T& a, T& b)     //Like std::swap - just for tests
{
    T tmp = std::move(a);
    a = std::move(b);
    b = std::move(tmp);
}

template<typename T>
void mySwapStatic(T& a, T& b)   //Here with static tmp
{
    static T tmp;
    tmp = std::move(a);
    a = std::move(b);
    b = std::move(tmp);
}

class Test1 {       //Simple class with some simple types
    int foo;
    float bar;
    char bazz;
};

class Test2 {       //Class with std::vector in it
    int foo;
    float bar;
    char bazz;
    std::vector<int> bizz;
public:
    Test2()
    {
        bizz = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    }
};

#define Test Test1      //choosing class

const static unsigned int NUM_TESTS = 100000000;
static Test a, b;   //making it static to prevent throwing out from code by compiler optimizations

template<typename T, typename F>
auto test(unsigned int numTests, T& a, T& b, const F swapFunction )     //test function
{
    std::chrono::system_clock::time_point t1, t2;
    t1 = std::chrono::system_clock::now();
    for(unsigned int i = 0; i < NUM_TESTS; ++i)    {
        swapFunction(a, b);
    }
    t2 = std::chrono::system_clock::now();
    return std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
}

int main()
{
    std::chrono::system_clock::time_point t1, t2;
    std::cout << "Test 1. MySwap Result:\t\t" << test(NUM_TESTS, a, b, mySwap<Test>) << " nanoseconds\n";   //caling test function
    t1 = std::chrono::system_clock::now();
    for(unsigned int i = 0; i < NUM_TESTS; ++i)    {
        mySwap<Test>(a, b);
    }
    t2 = std::chrono::system_clock::now();
    std::cout << "Test 2. MySwap2 Result:\t\t" << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << " nanoseconds\n"; //This result slightly better then 1. why?!
    std::cout << "Test 3. MySwapStatic Result:\t" << test(NUM_TESTS, a, b, mySwapStatic<Test>) << " nanoseconds\n"; //test function with mySwapStatic
    t1 = std::chrono::system_clock::now();
    for(unsigned int i = 0; i < NUM_TESTS; ++i)    {
        mySwapStatic<Test>(a, b);
    }
    t2 = std::chrono::system_clock::now();
    std::cout << "Test 4. MySwapStatic2 Result:\t" << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << " nanoseconds\n"; //And again - it's better then 3...
    std::cout << "Test 5. std::swap Result:\t" << test(NUM_TESTS, a, b, std::swap<Test>) << " nanoseconds\n";   //calling test function with std::swap for comparsion. Mostly similar to 1...
    return 0;
}

Some results with Test defined as Test1 (g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2 called as g++ main.cpp -O3 -std=c++11):

Test 1. MySwap Result: 625,105,480 nanoseconds

Test 2. MySwap2 Result: 528,701,547 nanoseconds

Test 3. MySwapStatic Result: 338,484,180 nanoseconds

Test 4. MySwapStatic2 Result: 228,228,156 nanoseconds

Test 5. std::swap Result: 564,863,184 nanoseconds

My main question: is it good to use this implementation for swapping of simple types? I know that if you use it for swapping types with vectors, for example, then std::swap is better, and you can see it just by changing the Test define to Test2.

Second question: why are the results in test 1, 2, 3, and 4 so different? What am I doing wrong with the test function implementation?

A problem with the static variable is that it it not thread safe. — SebastianK, Mar 31 '15 at 09:10
Interesting. How does a swap without temporal variable compare with those? — YuppieNetworking, Mar 31 '15 at 09:25
swap with xor's? xor swap [slower](http://stackoverflow.com/questions/18287433/how-fast-is-stdswap-for-integer-types) — dFionov, Mar 31 '15 at 09:31
@SebastianK What if the static variables are also `thread_local` ? — Photon, Mar 31 '15 at 09:50
A problem is this will increase overall memory overhead, as for each type you swap, an instance of it will exist somewhere. Why not overload swap for your classes? — Neil Kirk, Mar 31 '15 at 10:29
`Is it good to use this implementation for simple types swapping?` No. **Always no** when performance appears **before** correctness and simplicity. For mere nanoseconds gain. — UmNyobe, Mar 31 '15 at 10:31
@NeilKirk - I think its add just one static variable for every instance. Not so much if it uses for primitive types... — dFionov, Mar 31 '15 at 11:01
Depends how many. Can become significant if lots of templates are used. — Neil Kirk, Mar 31 '15 at 11:03
@UmNyobe but it's not nanoseconds - it's about two times faster... — dFionov, Mar 31 '15 at 11:04
@NeilKirk and if I'll overload it for all needed types I'll also get all static tmp variables as using it with templates. — dFionov, Mar 31 '15 at 11:13
I was thinking of the vector case, swapping it the normal way in the overload. — Neil Kirk, Mar 31 '15 at 11:27
@NeilKirk vector case here just for example of minuses of this implementation. But i looking for good swap for simple types like points with three fields X, Y, Z. For vectors it's good to use std::swap from beginning — dFionov, Mar 31 '15 at 11:45
The allocation is dirt cheap *unless* it's a very huge object, or copy construction is significantly more expensive than assignment. As @ElderBug shows, your speed difference comes from a separate issue. — peterchen, Mar 31 '15 at 15:18
@Photon: Even with `thread_local`s, you'll have a problem with reentrancy (in admittedly unusual cases); if a `swap` has to recursively swap the same type (there are possible cases where a tree structure might require the memory locations of a node to remain unchanged, so you'd swap the node's contents, not the pointers to them), then oops, one `swap` begins, triggers another `swap` internally, and now the `thread_local` is replaced before the outer `swap` is finished with it. — ShadowRanger, Jun 23 '22 at 20:54

ElderBug · Accepted Answer · 2022-06-23T20:43:36.110

An answer to your second question first : in your test 2 and 4, the compiler is inlining the functions, thus it gives better performances (there is even more on test 4, but I will cover this later).

Overall, it is probably a bad idea to use a static temp variable.

Why ? First, it should be noted that in x86 assembly, there is no instruction to copy from memory to memory. This means that when you swap, there is, not one, but two temporary variables in CPU registers. And these temporary variables MUST be in CPU registers, you can't copy mem to mem, thus a static variable will add a third memory location to transfer to and from.

One problem with your static temp is that it will hinder inlining. Imagine if the variables you swap are already in CPU registers. In this case, the compiler can inline the swapping, and never copy anything to memory, which is much faster. Now, if you forced a static temp, either the compiler removes it (useless), or is forced to add a memory copy. That's what happen in test 4, in which GCC removed all the reads to the static variable. It just pointlessly write updated values to it because you told it do to so. The read removal explains the good performance gain, but it could be even faster.

Your test cases are flawed because they don't show this point.

Now you may ask : Then why my static functions perform better ? ~~I have no idea.~~ (Answer at the end)

I was curious, so I compiled your code with MSVC, and it turns out MSVC is doing it right, and GCC is doing it weird. At O2 optimization levels, MSVC detects that two swaps is a no-op and shortcut them, but even at O1, the non-inlined generated code is faster than all the test case with GCC at O3. (EDIT: Actually, MSVC is not doing it right either, see explanation at the end.)

The assembly generated by MSVC looks indeed better, but when comparing the static and non-static assembly generated by GCC, I don't know why the static performs better.

Anyway, I think even if GCC is generating weird code, the inlining problem should be worth using the std::swap, because with bigger types the additional memory copy could be costly, and smaller types give better inlining.

Here are the assembly produced by all the test cases, if someone have an idea of why the GCC static performs better than the non-static, despite being longer and using more memory moves. EDIT: Answer at the end

GCC non-static (perf 570ms):

    00402F90 44 8B 01             mov         r8d,dword ptr [rcx]
    00402F93 F3 0F 10 41 04       movss       xmm0,dword ptr [rcx+4]
    00402F98 0F B6 41 08          movzx       eax,byte ptr [rcx+8] 
    00402F9C 4C 8B 0A             mov         r9,qword ptr [rdx]
    00402F9F 4C 89 09             mov         qword ptr [rcx],r9
    00402FA2 44 0F B6 4A 08       movzx       r9d,byte ptr [rdx+8]
    00402FA7 44 88 49 08          mov         byte ptr [rcx+8],r9b
    00402FAB 44 89 02             mov         dword ptr [rdx],r8d 
    00402FAE F3 0F 11 42 04       movss       dword ptr [rdx+4],xmm0
    00402FB3 88 42 08             mov         byte ptr [rdx+8],al

GCC static and MSVC static (perf 275ms):

    00402F10 48 8B 01             mov         rax,qword ptr [rcx]  
    00402F13 48 89 05 66 11 00 00 mov         qword ptr [404080h],rax  
    00402F1A 0F B6 41 08          movzx       eax,byte ptr [rcx+8]  
    00402F1E 88 05 64 11 00 00    mov         byte ptr [404088h],al  
    00402F24 48 8B 02             mov         rax,qword ptr [rdx]  
    00402F27 48 89 01             mov         qword ptr [rcx],rax  
    00402F2A 0F B6 42 08          movzx       eax,byte ptr [rdx+8]  
    00402F2E 88 41 08             mov         byte ptr [rcx+8],al  
    00402F31 48 8B 05 48 11 00 00 mov         rax,qword ptr [404080h]  
    00402F38 48 89 02             mov         qword ptr [rdx],rax  
    00402F3B 0F B6 05 46 11 00 00 movzx       eax,byte ptr [404088h]  
    00402F42 88 42 08             mov         byte ptr [rdx+8],al

MSVC non-static (perf 215ms):

    00000   f2 0f 10 02  movsdx  xmm0, QWORD PTR [rdx]
    00004   f2 0f 10 09  movsdx  xmm1, QWORD PTR [rcx]
    00008   44 8b 41 08  mov     r8d, DWORD PTR [rcx+8]
    0000c   f2 0f 11 01  movsdx  QWORD PTR [rcx], xmm0
    00010   8b 42 08     mov     eax, DWORD PTR [rdx+8]
    00013   89 41 08     mov     DWORD PTR [rcx+8], eax
    00016   f2 0f 11 0a  movsdx  QWORD PTR [rdx], xmm1
    0001a   44 89 42 08  mov     DWORD PTR [rdx+8], r8d

std::swap versions are all identical to the non-static versions.

After having some fun investigating, I found the likely reason of the bad performance of the GCC non-static version. Modern processors have a feature called store-to-load forwarding. This feature kicks in when a memory load match a previous memory store, and shortcut the memory operation to use the value already known. In this case GCC somehow use an asymmetric load/store for parameter A and B. A is copied using 4+4+1 bytes, and B is copied using 8+1 bytes. This means the 8 first bytes of the class won't be matched by the store-to-load forwarding, losing a precious CPU optimization. To check this, I manually replaced the 8+1 copy by a 4+4+1 copy, and the performance went up as expected (code below). In the end, GCC is at fault for not considering this.

GCC patched code, longer but taking advantage of store forwarding (perf 220ms) :

00402F90 44 8B 01             mov         r8d,dword ptr [rcx]  
00402F93 F3 0F 10 41 04       movss       xmm0,dword ptr [rcx+4]  
00402F98 0F B6 41 08          movzx       eax,byte ptr [rcx+8]

00402F9C 4C 8B 0A             mov         r9,qword ptr [rdx]
00402F9F 4C 89 09             mov         qword ptr [rcx],r9
00402F9C 44 8B 0A             mov         r9d,dword ptr [rdx]
00402F9F 44 89 09             mov         dword ptr [rcx],r9d
00402FA2 44 8B 4A 04          mov         r9d,dword ptr [rdx+4]
00402FA6 44 89 49 04          mov         dword ptr [rcx+4],r9d

00402FAA 44 0F B6 4A 08       movzx       r9d,byte ptr [rdx+8]  
00402FAF 44 88 49 08          mov         byte ptr [rcx+8],r9b  
00402FB3 44 89 02             mov         dword ptr [rdx],r8d  
00402FB6 F3 0F 11 42 04       movss       dword ptr [rdx+4],xmm0  
00402FBB 88 42 08             mov         byte ptr [rdx+8],al

Actually, this copy instructions (symmetric 4+4+1) is the right way to do it. In these test we are only doing copies, in which case MSVC version is the best without doubt. The problem is that in a real case, the class member will be accessed individually, thus generating 4 bytes read/write. MSVC 8 bytes batch copy (also generated by GCC for one argument) will prevent future store forwarding for individual members. A new test I did with member operation beside copies show that the patched 4+4+1 versions indeed outperform all the others. And by a factor of nearly x2. Sadly, no modern compiler is generating this code.

@ArneVogel I already checked that. I don't know how you saw that, but agner fog tables indicate identical throughput for movss and mov, for all Intel Core archs. — ElderBug, Apr 01 '15 at 10:15
@ArneVogel Yes, and the throughput for mov r,m page 159 is also 0.5. And as a side note, 0.5 is 4 times faster than 2. About MOVSS/D, it means MOVSS + MOVSD, single precision and double precision float moves. — ElderBug, Apr 01 '15 at 10:38
Sorry, I must have looked at the latency column, and 0.5 is faster, so it would not made have sense this way around anyway. — Arne Vogel, Apr 01 '15 at 11:44
clang 3.4 works bad too... non-static ~565 ms, static - ~600. (gcc version on my machine non-static ~550 ms, static - ~230 ms)... all with -O3 — dFionov, Apr 01 '15 at 13:36
@dFionov I don't have clang here, but with your numbers, clang is probably producing the same bad code as GCC, but also for the static version (GCC at least produce good code for the static version). Honestly, I don't get how GCC ends up emitting 4+4/8 asymmetric copies. If you didn't see, I added the bad performance explanation at the end of my answer. — ElderBug, Apr 01 '15 at 13:53
@ElderBug I see it, great explanation! But it make me fell bad... A lot of code use same swap functions and compiles with gcc or clang... — dFionov, Apr 01 '15 at 14:57
@dFionov I just added another paragraph at the end. Even MSVC is not doing it right either (it is better only in some cases). That makes me feel sad too, I didn't know about that before investigating, and I guess compiler writters don't know either. Oh, and the fast static version is wrong too. You better stick with std::swap, probably the best among the bad :) — ElderBug, Apr 01 '15 at 15:07
@ElderBug It's true - std::swap better way. I was starting this benchmark after my friend question(he just learn static keyword :) ) and trying to show him that it wrong way... But this results... I waste my monday to trying find some useful information about. — dFionov, Apr 01 '15 at 17:16

Reimplementing std::swap() with static tmp variable for simple types C++

1 Answers1