memcpy() performance- Ubuntu x86_64

Question

I am observing some weird behavior which I am not being able to explain. Following are the details :-

#include <sched.h>
#include <sys/resource.h>
#include <time.h>
#include <iostream>

void memcpy_test() {
    int size = 32*4;
    char* src = new char[size];
    char* dest = new char[size];
    general_utility::ProcessTimer tmr;
    unsigned int num_cpy = 1024*1024*16; 
    struct timespec start_time__, end_time__;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
    for(unsigned int i=0; i < num_cpy; ++i) {
        __builtin_memcpy(dest, src, size);
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
    std::cout << "time = " << (double)(end_time__.tv_nsec - start_time__.tv_nsec)/num_cpy << std::endl;
    delete [] src;
    delete [] dest;
}

When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ? If anything, I would expect -march=native to produce optimized code. Is there other functions which could show this type of behavior ?

EDIT 1: Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated

EDIT 2: Following are the detailed performance analysis (__builtin_memcpy()) :-

size = 32 * 4, without -march=native - 7.5 ns, with -march=native - 19.3

size = 32 * 8, without -march=native - 26.3 ns, with -march=native - 26.5

EDIT 3 :

This observation does not change even if I allocate int64_t/int32_t.

EDIT 4 :

size = 8192, without -march=native ~ 2750 ns, with -march=native ~ 2750 (Earlier, there was an error in reporting this number, it was wrongly written as 26.5, now it is correct )

I have run these many times and numbers are consistent across each run.

and what are the results if the size is (really) large? maybe it does optimize it, but for larger blocks. — Karoly Horvath, Jul 23 '11 at 16:33
@Faraz, you're using arrays of `char`. [AFAICT](http://stackoverflow.com/questions/4009463/alignment-of-char-arrays) those are byte-aligned. `__builtin_memcpy()` might be optimized for higher alignment boundaries under `-march=native` on your platform. Can you try with arrays of `int` or `long`? — Frédéric Hamidi, Jul 23 '11 at 16:55
*I* don't know.. just try to find a size where the -march=native compiled code beats the normal one. — Karoly Horvath, Jul 23 '11 at 16:57
@Frédéric: I upvoted it but now I realized that it's not going to be true since the memory allocator will allign that to 4 or 8 bytes. — Karoly Horvath, Jul 23 '11 at 17:03
@yi_H, which memory allocator? `::operator new[]`? Why would it do that? :) — Frédéric Hamidi, Jul 23 '11 at 17:05
@Faraz, do you mean `no difference with or without -march=native` (problem solved) or `no difference compared to the behavior with char` (back to square one)? — Frédéric Hamidi, Jul 23 '11 at 17:10
@Frédéric: for simplicity, for performance reasons (you have to access that allocated memory later) and allocating it on arbitrary position is probably also bad for fragmentation. in fact it will probably add padding bytes after it so it ends on a bigger boundary. (eg: it has separate pages for 32, 64, 128, 256 byte long chunks and it will return a free chunk from one that can hold both the requested data and the maintance extra info) — Karoly Horvath, Jul 23 '11 at 17:15
@yi_H, For large blocks I am not getting consistent results, basically results are varying for different run, so it looks to me that there is some measurement glitch for larger blocks. Any ideas ? — Faraz, Jul 23 '11 at 17:19
Oh, got it, let me correct that. Not taking seconds into account for programs that will run more than 1 sec — Faraz, Jul 23 '11 at 17:22

score 3 · Answer 1 · answered Jul 23 '11 at 18:18

I have replicated your findings with: g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, Linux 2.6.38-10-generic #46-Ubuntu x86_64 on my Core 2 Duo. Results will probably vary depending on your compiler version and CPU. I get ~26 and ~9.

When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ?

Because the -march=native version gets compiled into (found using objdump -D you could also use gcc -S -fverbose-asm):

    rep movsq %ds:(%rsi),%es:(%rdi) ; where rcx = 128 / 8

And the version without gets compiled into 16 load/store pairs like:

    mov    0x20(%rbp),%rdx
    mov    %rdx,0x20(%rbx)

Which apparently is faster on our computers.

If anything, I would expect -march=native to produce optimized code.

In this case it turned out to be a pessimization to favor rep movsq over a series of moves, but that might not always be the case. The first version is shorter, which might be better in some (most?) cases. Or it could be a bug in the optimizer.

Is there other functions which could show this type of behavior ?

Any function for which the generated code differs when you specify -march=native, suspects include functions implemented as macros or static in headers, has a name beginning with __builtin. Possibly also (floating point) math functions.

Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated

This is because then they both compile to rep movsq, 128 is probably the largest size for which GCC will generate a series of load/stores (would be interesting to see if this also for other platforms). BTW when the compiler doesn't know the size at compile time (e.g. int size=atoi(argv[1]);) then it simply turns into a call to memcpy with or without the switch.

score 0 · Answer 2 · answered Feb 20 '13 at 19:30

It's quite known issue (and really old one).

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052

look at some bottom comment in a bug report:

"Just FYI: mesa is now defaulting to -fno-builtin-memcmp to workaround this problem"

Looks like glibc's memcpy is far better than builtin...

memcpy() performance- Ubuntu x86_64

2 Answers2