Why is this slower than memcmp

Question

I am trying to compare two rows of pixels.

A pixel is defined as a struct containing 4 float values (RGBA).

The reason I am not using memcmp is because I need to return the position of the 1st different pixel, which memcmp does not do.

My first implementation uses SSE intrinsics, and is ~30% slower than memcmp:

inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
    for (int i = 0; i < count; i++)
    {
        __m128 x = _mm_load_ps((float*)(a + i));
        __m128 y = _mm_load_ps((float*)(b + i));
        __m128 cmp = _mm_cmpeq_ps(x, y);
        if (_mm_movemask_ps(cmp) != 15) return i;
    }
    return -1;
}

I then found that treating the values as integers instead of floats sped things up a bit, and is now only ~20% slower than memcmp.

inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
    for (int i = 0; i < count; i++)
    {
        __m128i x = _mm_load_si128((__m128i*)(a + i));
        __m128i y = _mm_load_si128((__m128i*)(b + i));
        __m128i cmp = _mm_cmpeq_epi32(x, y);
        if (_mm_movemask_epi8(cmp) != 0xffff) return i; 
    }
    return -1;
}

From what I've read on other questions, the MS implementation of memcmp is also implemented using SSE. My question is what other tricks does the MS implementation have up it's sleeve that I don't? How is it still faster even though it does a byte-by-byte comparison?

Is alignment an issue? If the pixel contains 4 floats, won't an array of pixels already be allocated on a 16 byte boundary?

I am compiling with /o2 and all the optimization flags.

No, there's no guarantee that it will be aligned on 16 bytes unless you take care of it yourself. — interjay, Feb 10 '13 at 10:14
Yes, alignment is an issue. The compilation options you use are also relevant. You should also show the generated asm. Maybe the compiler is missing loop unrolling or some other optimization on your code. — Marc Glisse, Feb 10 '13 at 10:18

score 3 · Answer 1 · answered Feb 10 '13 at 11:02

3

You might want to check this memcmp SSE implementation, specifically the __sse_memcmp function, it starts with some sanity checks and then checks if the pointers are aligned or not:

aligned_a = ( (unsigned long)a & (sizeof(__m128i)-1) );
aligned_b = ( (unsigned long)b & (sizeof(__m128i)-1) );

If they are not aligned it compares the pointers byte by byte until the start of an aligned address:

 while( len && ( (unsigned long) a & ( sizeof(__m128i)-1) ) )
{
   if(*a++ != *b++) return -1;
   --len;
}

And then compares the remaining memory with SSE instructions similar to your code:

 if(!len) return 0;
while( len && !(len & 7 ) )
{
__m128i x = _mm_load_si128( (__m128i*)&a[i]);
__m128i y = _mm_load_si128( (__m128i*)&b[i]);
....

answered Feb 10 '13 at 11:02

iabdalkader

17,009
4
47
74

Thank you, that code was useful in a way, though I'm dealing with data that can be pre-aligned so all the sanity checks and tail logic is not relevant to my case. – Rotem Feb 10 '13 at 12:34
@Rotem if you can align the data, then you don't really need that, have you tested the code with alignment ? – iabdalkader Feb 10 '13 at 12:43
Yes, results were identical (aligned via `_aligned_malloc`). I am not sure how to explain that fact. – Rotem Feb 10 '13 at 12:45

Mats Petersson · Accepted Answer · 2013-02-10T12:38:26.417

I have written strcmp/memcmp optimizations with SSE (and MMX/3DNow!), and the first step is to ensure that the arrays are as aligned as possible - you may find that you have to do the first and/or last bytes "one at a time".

If you can align the data before it gets to the loop [if your code does the allocation], then that's ideal.

The second part is to unroll the loop, so you don't get so many "if loop isn't at the end, jump back to beginning of loop" - assuming the loop is quite long.

You may find that preloading the next data of the input before doing the "do we leave now" condition helps too.

Edit: The last paragraph may need an example. This code assumes an unrolled loop of at least two:

 __m128i x = _mm_load_si128((__m128i*)(a));
 __m128i y = _mm_load_si128((__m128i*)(b));

 for(int i = 0; i < count; i+=2)
 {
    __m128i cmp = _mm_cmpeq_epi32(x, y);

    __m128i x1 = _mm_load_si128((__m128i*)(a + i + 1));
    __m128i y1 = _mm_load_si128((__m128i*)(b + i + 1));

    if (_mm_movemask_epi8(cmp) != 0xffff) return i; 
    cmp = _mm_cmpeq_epi32(x1, y1);
    __m128i x = _mm_load_si128((__m128i*)(a + i + 2));
    __m128i y = _mm_load_si128((__m128i*)(b + i + 2));
    if (_mm_movemask_epi8(cmp) != 0xffff) return i + 1; 
}

Roughly something like that.

Unrolling made all the difference! I unrolled by a factor of 4 and went from 20% slower than `memcmp` to 20% faster than `memcmp`. For some reason, aligning seemed to have made no difference at all (`malloc` vs `_aligned_malloc(16)`). Can you please explain your last paragraph? I did not understand what you mean. — Rotem, Feb 10 '13 at 12:28
If your input arrays weren't already 16-byte aligned, then you would get a crash, because you are using the aligned versions of the load functions (e.g. `_mm_load_si128()` versus `_mm_loadu_si128()`. If you want to be robust against potentially-unaligned inputs, then you can use the unaligned load functions, but there will be a slight performance hit even if the arrays are aligned. — Jason R, Feb 10 '13 at 15:04

score 0 · Answer 3 · answered Feb 10 '13 at 10:12

I cannot help you directly because I'm using Mac, but there's an easy way to figure out what happens:

You just step into memcpy in the debug mode and switch to Disassembly view. As the memcpy is a simple little function, you will easily figure out all the implementation tricks.

Why is this slower than memcmp

3 Answers3