I am observing some weird behavior which I am not being able to explain. Following are the details :-
#include <sched.h>
#include <sys/resource.h>
#include <time.h>
#include <iostream>
void memcpy_test() {
int size = 32*4;
char* src = new char[size];
char* dest = new char[size];
general_utility::ProcessTimer tmr;
unsigned int num_cpy = 1024*1024*16;
struct timespec start_time__, end_time__;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
for(unsigned int i=0; i < num_cpy; ++i) {
__builtin_memcpy(dest, src, size);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start_time__);
std::cout << "time = " << (double)(end_time__.tv_nsec - start_time__.tv_nsec)/num_cpy << std::endl;
delete [] src;
delete [] dest;
}
When I specify -march=native in compiler options, generated binary runs 2.7 times slower. Why is that ? If anything, I would expect -march=native to produce optimized code. Is there other functions which could show this type of behavior ?
EDIT 1: Another interesting point is that if size > 32*4 then there is no difference between the run time of the binaries thus generated
EDIT 2: Following are the detailed performance analysis (__builtin_memcpy()) :-
size = 32 * 4, without -march=native - 7.5 ns, with -march=native - 19.3
size = 32 * 8, without -march=native - 26.3 ns, with -march=native - 26.5
EDIT 3 :
This observation does not change even if I allocate int64_t/int32_t.
EDIT 4 :
size = 8192, without -march=native ~ 2750 ns, with -march=native ~ 2750 (Earlier, there was an error in reporting this number, it was wrongly written as 26.5, now it is correct )
I have run these many times and numbers are consistent across each run.