14

What is the fastest way to swap two non-overlapping memory areas of equal size? Say, I need to swap (t_Some *a) with (t_Some *b). Considering space-time trade-off, will increased temporary space improve the speed? For example, (char *tmp) vs (int *tmp)? I am looking for a portable solution.

Prototype:

void swap_elements_of_array(void* base, size_t size_of_element, int a, int b);
psihodelia
  • 29,566
  • 35
  • 108
  • 157

9 Answers9

7

The fastest way to move a block of memory is going to be memcpy() from <string.h>. If you memcpy() from a to temp, memmove() from b to a, then memcpy() from temp to b, you’ll have a swap that uses the optimized library routines, which the compiler probably inlines. You wouldn’t want to copy the entire block at once, but in vector-sized chunks.

In practice, if you write a tight loop, the compiler can probably tell that you’re swapping every element of the arrays and optimize accordingly. On most modern CPUs, you want to generate vector instructions. It might be able to generate faster code if you make sure all three buffers are aligned.

However, what you really want to do is make things easier for the optimizer. Take this program:

#include <stddef.h>

void swap_blocks_with_loop( void* const a, void* const b, const size_t n )
{
  unsigned char* p;
  unsigned char* q;
  unsigned char* const sentry = (unsigned char*)a + n;

  for ( p = a, q = b; p < sentry; ++p, ++q ) {
     const unsigned char t = *p;
     *p = *q;
     *q = t;
  }
}

If you translate that into machine code as literally written, it’s a terrible algorithm, copying one byte at a time, doing two increments per iteration, and so on. In practice, though, the compiler sees what you’re really trying to do.

In clang 5.0.1 with -std=c11 -O3, it produces (in part) the following inner loop on x86_64:

.LBB0_7:                                # =>This Inner Loop Header: Depth=1
        movups  (%rcx,%rax), %xmm0
        movups  16(%rcx,%rax), %xmm1
        movups  (%rdx,%rax), %xmm2
        movups  16(%rdx,%rax), %xmm3
        movups  %xmm2, (%rcx,%rax)
        movups  %xmm3, 16(%rcx,%rax)
        movups  %xmm0, (%rdx,%rax)
        movups  %xmm1, 16(%rdx,%rax)
        movups  32(%rcx,%rax), %xmm0
        movups  48(%rcx,%rax), %xmm1
        movups  32(%rdx,%rax), %xmm2
        movups  48(%rdx,%rax), %xmm3
        movups  %xmm2, 32(%rcx,%rax)
        movups  %xmm3, 48(%rcx,%rax)
        movups  %xmm0, 32(%rdx,%rax)
        movups  %xmm1, 48(%rdx,%rax)
        addq    $64, %rax
        addq    $2, %rsi
        jne     .LBB0_7

Whereas gcc 7.2.0 with the same flags also vectorizes, unrolling the loop less:

.L7:
        movdqa  (%rcx,%rax), %xmm0
        addq    $1, %r9
        movdqu  (%rdx,%rax), %xmm1
        movaps  %xmm1, (%rcx,%rax)
        movups  %xmm0, (%rdx,%rax)
        addq    $16, %rax
        cmpq    %r9, %rbx
        ja      .L7

Convincing the compiler to produce instructions that work on a single word at a time, instead of vectorizing the loop, is the opposite of what you want!

Davislor
  • 14,674
  • 2
  • 34
  • 49
5

Your best bet is to maximize registers usage so that when you read a temporary you don't end up with extra (likely cached) memory accesses. Number of registers will depend on a system and registers allocation (the logic that maps your variables onto actual registers) will depend on a compiler. So your best bet is I guess to expect only one register and expect its size to be the same as the pointer. Which boils down to a simple for-loop dealing with blocks interpreted as arrays of size_t.

sharptooth
  • 167,383
  • 100
  • 513
  • 979
  • 1
    Unless the two blocks have different alignments, in which case the loop is not so simple, since you can't portably interpret as a `size_t[]`. – Steve Jessop Nov 17 '11 at 12:08
2

Word writes will be the fastest. However, both block size and alignment need to be considered. In practice things are usually aligned sensibly, but you shouldn't count on it. memcpy() safely handles everything and may be specialized (built-in) for constant sizes within reason.

Here is a portable solution that works reasonably well in most cases.

static void swap_byte(void* a, void* b, size_t count)
{
    char* x = (char*) a;
    char* y = (char*) b;

    while (count--) {
        char t = *x; *x = *y; *y = t;
        x += 1;
        y += 1;
    }
}

static void swap_word(void* a, void* b, size_t count)
{
    char* x = (char*) a;
    char* y = (char*) b;
    long t[1];

    while (count--) {
        memcpy(t, x, sizeof(long));
        memcpy(x, y, sizeof(long));
        memcpy(y, t, sizeof(long));
        x += sizeof(long);
        y += sizeof(long);
    }
}

void memswap(void* a, void* b, size_t size)
{
    size_t words = size / sizeof(long);
    size_t bytes = size % sizeof(long);
    swap_word(a, b, words);
    a = (char*) a + words * sizeof(long);
    b = (char*) b + words * sizeof(long);
    swap_byte(a, b, bytes);
}
denizen666
  • 21
  • 1
2

If the 2 memory areas are large and fit in integer number of memory pages then you can swap their Page Table Entries in order to swap their contents without using memcpy() or XORs.

In theory, with two large 2MiB pages, you need to write only 16 bytes of paging structures to swap their mapping in the virtual address space ...and hence their contents, too.

1GiB pages are possible on x86-64 CPUs in 64-bit mode and content of 2 such 1GiB memory blocks can also be swapped with writing only several bytes of paging structures.

The caveat of this method is that the access to paging structures requires Kernel Mode privileges or using shared memory mapping functions from User Mode.

With recent Meltdown patches (KPTI), transitioning to Kernel Mode from User Mode has become much more expensive. Probably too expensive to make 4kiB memory page swapps competitive with memcpy()...but if you have 2MB or larger memory blocks to swap, then swapping their Paging Structures is faster.

George Robinson
  • 1,500
  • 9
  • 21
  • This solution seems the like opposite of portable, OP does not tag an OS, only C – Evan Benn Feb 16 '18 at 03:11
  • 1
    Yes, this solution is not very portable but not absolutely because it will work on any CPU that has a memory paging unit. That means any large Intel or AMD CPU and some ARM CPUs. Which includes most server, desktop and mobile CPUs. Not microcontrollers though... – George Robinson Feb 16 '18 at 18:18
  • You have convinced me the concept is at least reasonably portable. But would the implementation between those different OSes, Compilers, Chips combinations really be the same C code? – Evan Benn Feb 18 '18 at 23:55
  • 1
    From Kernel Mode, the code to swap two pages memory would be identical on any x86 Intel/AMD CPU, regardless of the OS, but the code to obtain the addresses of these two pages might be be OS specific. From User Mode, all of the code would be OS specific, because you have to rely on memory mapping functions exposed by the kernel. The same goes for ARM CPUs and OSes running on them. – George Robinson Feb 20 '18 at 16:27
0

Thought I'd share my simple solution I've been using for ages on micro controllers without drama.

#define swap(type, x, y) { type _tmp; _tmp = x; x = y; y = _tmp; }

OK... it creates a stack variable but it's usually for uint8_t, uint32_t, float, double, etc. However it should work on structures just as well.

The compiler should be smart enough to see the stack variable can be swapped for a register when the size of the type permits.

Really only meant for small types... which will probably suit 99% of cases.

Could also use "auto" instead of passing the type... but I like to be more flexible and I suppose "auto" could be passed as the type.

examples...

swap(uint8_t, var1, var2) 
swap(float, fv1, fv2)
swap(uint32_t, *p1, *p2) // will swap the contents as p1 and p2 are pointers
swap(auto, var1, var2) // should work fine as long as var1 and var2 are same type
TheWhitde
  • 21
  • 2
0

The speed for this will be partly platform dependent and only really borne out by testing.

Personally I'd favour creating a memory block of equal size to one of the arrays; use memcpy to swap the contents around, using the newly created memory block as swap space.

Now the size of the memory block will have an impact on the speed of operation (again platform dependent) and so you may find that for very large arrays swapping smaller amounts of data back and forth is faster than swapping a large chunk each time.

edit

In light of the comment let me explain, my last comment about swapping smaller amounts of data.

Your aim is to transfer a data to b and b data to a using a temporary swap space tmp.

The size of tmp is equal to or less than the size of a or b and the number of iterations of swapping data increases as the size of tmp is reduced e.g. if tmp is a 10th of the a then 10 iterations will be needed.

Now in order to aid the speed of memcpy it is best to ensure that the arrays (a, b and tmp) are allocated aligned memory space.

ChrisBD
  • 9,104
  • 3
  • 22
  • 35
0

You could use the logic described here. This way, you could save a third buffer.

#include <stddef.h>
#include <stdint.h>
void swap(uint8_t *a, uint8_t *b, size_t length) {
    size_t i;
    for (i=0; i<length; i++) {
        uint8_t aa = a[i];
        aa^=b[i];
        b[i]^=aa;
        aa^=b[i];
        a[i] = aa;
    }
}

Even only this one temporary variable is enough to help the compiler optimize this.


But if you use such a temporary variable, you can do as well

#include <stddef.h>
#include <stdint.h>
void swap(uint8_t *a, uint8_t *b, size_t length) {
    size_t i;
    for (i=0; i<length; i++) {
        uint8_t aa = a[i];
        a[i] = b[i];
        b[i] = aa;
    }
}

In the first glance, both of them look expensive due to the many array accesses (in the 1st case) and the processing of only one byte per loop run, but if you let your compiler optimize this, it should be ok, as (at least gcc) is smart enough to bundle always 4 steps (in x64: even 16 steps) into one loop run.

Note that your compiler might not optimize so aggressively, so you might have to do the said splitting by yourself. In this case, take care about the alignment.

undur_gongor
  • 15,657
  • 5
  • 63
  • 75
glglgl
  • 89,107
  • 13
  • 149
  • 217
  • 1
    -1: This invokes undefined behaviour. And the XOR-swap trick will probably preclude compiler optimisations. – Oliver Charlesworth Nov 17 '11 at 12:18
  • 1. As I said, at least gcc even recognizes what is tried to do and optimizes it and 2. yould you be more specific about the UB? – glglgl Nov 17 '11 at 12:22
  • 1
    Lack of sequence points. Take a look at the [Wikipedia article](http://en.wikipedia.org/wiki/XOR_swap_algorithm). – Oliver Charlesworth Nov 17 '11 at 12:25
  • You are completely right. Changed example to a) the correct XOR algorithm and b) to the "normal" swap algorithm, as saving a variable is not achieved with the first one. – glglgl Nov 17 '11 at 12:42
0
#include <string.h>
#include <stdio.h>

static void swap_elements_of_array(void* base, size_t size_of_element, int a, int b);
static void swap_elements_of_array(void* base, size_t size_of_element, int a, int b)
{
union {
    int i; /* force alignment */
    char zzz[size_of_element] ; /* VLA */
    } swap;
memcpy (swap.zzz, (char*)base + a * size_of_element,size_of_element);
memcpy ((char*)base + a * size_of_element,(char*)base + b * size_of_element,size_of_element);
memcpy ((char*)base + b * size_of_element, swap.zzz, size_of_element);
}

int main (void)
{
unsigned idx,array[] = {0,1,2,3,4,5,6,7,8,9};

swap_elements_of_array(array, sizeof array[0], 2, 5);

for (idx=0; idx < 10; idx++) {
    printf( "%u%c", array[idx], (idx==9) ? '\n' : ' ' );
    }
return 0;
}

The intention of the above fragment is to allow the highly optimised libc versions of memcpy (or inlining by the compiler) to take all the freedom they need. The alignment is crucial. If VLAs are not avalable (before C99) a macro can be composed, using a funky do-while.

wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • If `size_of_element` is large, this doesn't look very efficient from a cache perspective, unless the compiler is smart enough to interleave the `memcpy`s. – Oliver Charlesworth Nov 17 '11 at 12:20
  • C99 style is not highly portable. Are you sure that memcpy is faster than loop: swp(i32,i32) because of temporary memory (which is not registers)? – psihodelia Nov 17 '11 at 12:23
  • 1
    Good enough for government work. It is hard to outsmart libc, without being an expert in assembly. I agree that for larger sizes an "outside" inner loop (sizeof cache, aligned on cache boundary) would probably be better. – wildplasser Nov 17 '11 at 12:26
  • 1
    @psilodelia: that's what I said: macroizing the above code is trivial. (and left as an exercise) – wildplasser Nov 17 '11 at 12:28
0

Obviously, you have to copy A to Temp, copy B to A, then copy Temp to B. You can do this all at once, for a small area, or do it in sections for a larger area, where you don't want to allocate such a large Temp value. The choice of section size is up to you, though consideration of alignment and cache issues appropriate for the hardware is important, for large, frequent moves.

(Well, actually there is another way, which doesn't require any temp space: XOR A with B, then XOR B with A, then XOR A with B. An old assembly programmer's trick.)

Hot Licks
  • 47,103
  • 17
  • 93
  • 151