Fill with/without intrinsics C++

Question

I'm studying intrinsic functions impact on performance, and I'm a little bit confused: they seem to have no impact at all! I'm trying to fill an array of doubles with two different functions and I see no differences. I allocated the array with a call to _aligned_malloc with alignment parameter set to 8. I use Visual Studio 2008 and I compiled in Release mode, both with and without optimizations (/O2 - /Od) and both with and without intrinsics (/Oi) - all the four combinations. Two different versions follow:

#ifdef _NO_INTRIN

void my_fill(double* vett, double value, int N)
{
    double* last = vett + N;
    while( vett != last)
    {
        *vett++ = value;
    }
}

#else

void my_fill(double* vett, double value, int N)
{
    double* last = vett + N;

    // set "classically" unaligned data, if any
    while( (0xF & (uintptr_t)vett) && vett != last )
        *vett++ = value;

    __m128d* vett_ = (__m128d*)vett;
    uintptr_t fff0 = ~0 << 4;
    // round address to nearest aligned data setting to zero least significant 4 bits
    __m128d* last_ = (__m128d*)( fff0 & (uintptr_t)last);
    // process until second-last element to manage odd values of N
    for( ; vett_ < last_-1; vett_++ )
    {
        *vett_ = _mm_set1_pd(value);
    }

    vett = (double*)vett_;
    while(vett != last)
        *vett++ = value;
}    

#endif

As a last specification, I aligned my data to 8B and not to 16 because I plan to execute this function in a multi-threaded way on different portions of the array. So, also aligning data to 16B I couldn't be sure that all the portions of the array would be aligned (es. 303 elements, 3 threads, 101 element per thread, 1st portion aligned to 16B, 2nd portion starting @ vett+101*8 ==> unaligned). That's why I tried to implement an alignment-agnostic function. I tried to fill an array of 1M elements on my Intel Atom CPU N570 @ 1.66 GHz and I got always the same execution time. So... what's wrong with my approach? Why I see no differences? Thank you all in advance.

If you are not seeing any performance difference at all between the four samples, neither positive nor negative, then you are not measuring correctly. — Bart van Ingen Schenau, Dec 30 '12 at 17:12
I don't want to read this because you starting with intrinsics of a very particular compiler and platform and you don't even tag your question with it. — Jens Gustedt, Dec 30 '12 at 17:40
One guess is that the compiler writer already knows all of this, and takes care of any problems whichever way you write your code. — Bo Persson, Dec 30 '12 at 19:16
Yes, I think Bo is right. Have a look at the code generated by the compiler. I have certainly seen gcc do similar things when I tried to optimize some code, and then realized when I looked at the assembler code that it was actually pretty much the same [once I'd enabled the SSE2 option, which the compiler didn't do by default and thus moaned at my inline assembler] — Mats Petersson, Dec 31 '12 at 17:15
@JensGustedt Thank you, thank you and again thank you. Without your comment, I wouldn't have understood anything. — biagiop1986, Dec 31 '12 at 19:37

score 0 · Answer 1 · answered Aug 28 '15 at 23:39

If you are not doing any sophisticated computation and you are purely writing fixed values into memory your program speed will likely be limited by the memory bandwidth. The CPU could internally produce the values at a faster rate, but it is bound by the rate it can transfer them into RAM (especially when dealing with large memory areas that don't fit into the CPU's cache)

Fill with/without intrinsics C++

1 Answers1