I'm studying intrinsic functions impact on performance, and I'm a little bit confused: they seem to have no impact at all! I'm trying to fill an array of doubles with two different functions and I see no differences. I allocated the array with a call to _aligned_malloc with alignment parameter set to 8. I use Visual Studio 2008 and I compiled in Release mode, both with and without optimizations (/O2 - /Od) and both with and without intrinsics (/Oi) - all the four combinations. Two different versions follow:
#ifdef _NO_INTRIN
void my_fill(double* vett, double value, int N)
{
double* last = vett + N;
while( vett != last)
{
*vett++ = value;
}
}
#else
void my_fill(double* vett, double value, int N)
{
double* last = vett + N;
// set "classically" unaligned data, if any
while( (0xF & (uintptr_t)vett) && vett != last )
*vett++ = value;
__m128d* vett_ = (__m128d*)vett;
uintptr_t fff0 = ~0 << 4;
// round address to nearest aligned data setting to zero least significant 4 bits
__m128d* last_ = (__m128d*)( fff0 & (uintptr_t)last);
// process until second-last element to manage odd values of N
for( ; vett_ < last_-1; vett_++ )
{
*vett_ = _mm_set1_pd(value);
}
vett = (double*)vett_;
while(vett != last)
*vett++ = value;
}
#endif
As a last specification, I aligned my data to 8B and not to 16 because I plan to execute this function in a multi-threaded way on different portions of the array. So, also aligning data to 16B I couldn't be sure that all the portions of the array would be aligned (es. 303 elements, 3 threads, 101 element per thread, 1st portion aligned to 16B, 2nd portion starting @ vett+101*8 ==> unaligned). That's why I tried to implement an alignment-agnostic function. I tried to fill an array of 1M elements on my Intel Atom CPU N570 @ 1.66 GHz and I got always the same execution time. So... what's wrong with my approach? Why I see no differences? Thank you all in advance.