C - fastest method to swap two memory blocks of equal size? (Solution feasibility)

Question

This question is an extension of this one. Here I present two possible solutions and I want to known their feasibility. I am using a Haswell microarchitecture with GCC/ICC compilers. I also assume that memory is aligned.

OPTION 1 - I have a memory position already allocated and do 3 memory moves. ~~(I use memmove instead of memcpy to avoid the copy constructor)~~.

void swap_memory(void *A, void* B, size_t TO_MOVE){

    memmove(aux, B, TO_MOVE);
    memmove(B, A, TO_MOVE);
    memmove(A, aux, TO_MOVE);
}

OPTION 2 - Use AVX or AVX2 loads and stores, taking advantage of the aligned memory. To this solution I consider that I swap int data types.

void swap_memory(int *A, int* B, int NUM_ELEMS){

    int i, STOP_VEC = NUM_ELEMS - NUM_ELEMS%8;
    __m256i data_A, data_B;

    for (i=0; i<STOP_VEC; i+=8) {
        data_A = _mm256_load_si256((__m256i*)&A[i]);
        data_B = _mm256_load_si256((__m256i*)&B[i]);

        _mm256_store_si256((__m256i*)&A[i], data_B);
        _mm256_store_si256((__m256i*)&B[i], data_A);
    }

    for (; i<NUM_ELEMS; i++) {
        std::swap(A[i], B[i]);
    }
}

Is the option 2 the fastest? Is there another faster implementation that I din't mention?

I would have guessed that (with optimizations turned on) gcc/icc would vectorize the loops for you, rather than requiring you to do it manually. — , May 19 '16 at 16:46
OP: "I use memmove instead of memcpy to avoid the copy constructor" → what? Both those functions work only with raw bytes, neither uses copy (or move) constructors or assignment operators. The first works correctly with overlapping ranges, though. — Javier Martín, May 19 '16 at 16:47
It's probably more of a design issue if you need to do all this copying - you should just be able to swap two pointers, surely ? — Paul R, May 19 '16 at 16:52
... scratch that -- if the pointers were marked `__restrict__`, I would expect gcc/icc to vectorize the loops for you. Without `__restrict__`, I'm not sure how many compilers these days will add tests for non-overlapping ranges to check whether it's safe to reorder the operations or not. — , May 19 '16 at 16:56
Why not measure and see for yourself? If option 1 won't turn out to be slow as molasses, colour me surprised. — n. m. could be an AI, May 19 '16 at 17:04
SteveLorimer, I just measure the time. OPT2 is faster. Paul R, I can´t just swap the the pointers. I have to swap all the memory. I just want to know if there are another way to do this, even more faster. — Hélder Gonçalves, May 19 '16 at 17:13
If you want even faster, maybe _mm512_load_si512? Might reach a point of diminishing returns, though. Measure the speed of a single memory copy too - you won't be able to get faster than probably half that. At best, you might be able to hint a prefetch part way into each for a small gain, if you can do a little bit of something else ahead of time. — Todd Christensen, May 20 '16 at 04:40

Todd Christensen · Accepted Answer · 2016-05-19T18:04:38.723

2

If you know for sure that the memory is aligned, using AVX may be best. Note that doing it explicitly may not be portable - it might be better to decorate the pointers such that they're known to be aligned (e.g. using an aligned attribute or similar.)

Most likely option 2 (or something semantically doing that) may be faster, since the pointers aren't restricted or anything. The compiler may not know that it's safe to reorder the memory or leave "aux" untouched.

Further, option 2 may be more threadsafe depending on how aux is set up.

It might be fine to use a local temporary and memcpy to/from that temporary in blocks or even all at once, as gcc might be able to vectorize that. Avoid using external temporaries, and make sure all of your structures are decorated as aligned.

edited May 19 '16 at 18:04

answered May 19 '16 at 16:54

Todd Christensen

1,297
8
11

I don't think you can use `alignas` to tell gcc that a pointer *point to* aligned memory. `alignas` only seems to work for aligning data itself (e.g. `alignas(32) foo[32]`). Option2 is better because the compiler almost certainly won't optimize the memcpy version into an interleaved loop that doesn't touch `aux`. Your last paragraph will probably end up making multiple calls to memcpy with a small count. – Peter Cordes May 19 '16 at 16:59
Perhaps a pointer to aligned_storage then? I usually just use gcc attributes or defines, etc. and forgot that alignas isn't a thing for pointed to memory. – Todd Christensen May 19 '16 at 18:09
The gcc-specific way is `A = __builtin_assume_align(A, 32)`. You can also `typedef __attribute__((aligned(32))) int align32_int`, and then `void foo(align32_int *A) {}`. IIRC, that does work, while alignas in that same typedef doesn't – Peter Cordes May 19 '16 at 18:43
1

Right, the typedef method is the way I typically would do it. Shame alignas can't be used that way. – Todd Christensen May 19 '16 at 18:52

score 0 · Answer 2 · answered May 19 '16 at 16:59

0

Option 2 does less reads, so I would expect it to be faster (of course it all depends on the size of the data, the performance advantage will be much less if everything fits in the cache).

You can also use the AVX intrinsic _mm256_stream_si256 instead of the stores (then you'll need a fence before reading the memory again).

answered May 19 '16 at 16:59

GdR

313
1
8

2

NT stores are worse if the buffer is small and you're going to read it soon. They evict the destination even if it was hot in cache. – Peter Cordes May 19 '16 at 17:00

score 0 · Answer 3 · answered May 19 '16 at 17:00

0

I would just do the following:

unsigned char t; 
unsigned char *da = A, *db = B; 
while(TO_MOVE--) { 
   t = *da; 
   *da++ = *db; 
   *db++ = t; 
}

On the basis that it's super clear and the optimiser is going have a good chance of doing a good job.

answered May 19 '16 at 17:00

Wayne Booth

424
2
8

1

This is good with auto-vectorization (`-O3`), but [absolute garbage without (`-O2`)](https://godbolt.org/g/cWhBJZ). You can use `-fopenmp` and `#pragma omp simd` even at -O2, though, [to get good results with gcc](https://godbolt.org/g/EQ3C7I). Since you have `int`s, and an element count of ints, it's silly to cast it to char. That makes the scalar cleanup loop worse. – Peter Cordes May 19 '16 at 17:08

C - fastest method to swap two memory blocks of equal size? (Solution feasibility)

3 Answers3

Linked