I am trying to compare two rows of pixel
s.
A pixel
is defined as a struct
containing 4 float
values (RGBA).
The reason I am not using memcmp
is because I need to return the position of the 1st different pixel, which memcmp
does not do.
My first implementation uses SSE
intrinsics, and is ~30% slower than memcmp
:
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128 x = _mm_load_ps((float*)(a + i));
__m128 y = _mm_load_ps((float*)(b + i));
__m128 cmp = _mm_cmpeq_ps(x, y);
if (_mm_movemask_ps(cmp) != 15) return i;
}
return -1;
}
I then found that treating the values as integers instead of floats sped things up a bit, and is now only ~20% slower than memcmp
.
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128i x = _mm_load_si128((__m128i*)(a + i));
__m128i y = _mm_load_si128((__m128i*)(b + i));
__m128i cmp = _mm_cmpeq_epi32(x, y);
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
}
return -1;
}
From what I've read on other questions, the MS implementation of memcmp
is also implemented using SSE
. My question is what other tricks does the MS implementation have up it's sleeve that I don't? How is it still faster even though it does a byte-by-byte comparison?
Is alignment an issue? If the pixel
contains 4 floats, won't an array of pixels already be allocated on a 16 byte boundary?
I am compiling with /o2
and all the optimization flags.