Processing byte pixels with SSE/SSE2 intrinsics in C

Question

I am programming, for cross-platform C, a library to do various things to webcam images. All operations are per-pixel and highly parallelizable - for example applying bit masks, multiplying color values by constants, etc. Therefore I think I can gain performance by using SSE/SSE2 intrinsics.

However, I am having a data format problem. My webcam library gives me webcam frames as a pointer (void*) to a buffer containing 24- or 32-bit byte pixels in ABGR or BGR format. I have been casting these to char* so that ptr++ etc behaves correctly. However, all the SSE/SSE2 operations expect either four integers or four floats, in the __m128 or __m64 data types. If I do this (assuming I have read the color values from the buffer into chars r, g, and b):

float pixel[] = {(float)r, (float)g, {float)b, 0.0f};

then load another float array full of constants

float constants[] = {0.299, 0.587, 0.114, 0.0f};

cast both float pointers to __m128, and use the __mm_mul_ps intrinsic to do r * 0.299, g * 0.587 etc etc... there is no overall performance gain because all the shuffling stuff around takes up so much time!

Does anyone have any suggestions for how I can load these byte pixel values quickly and efficiently into the SSE registers so that I actually get a performance gain from operating on them as such?

Do you need to perform floating-point operations? There is also MMX, which works on integer types. — Drew Dormann, Dec 22 '09 at 00:38
Indeed. If you're working on integer types, you should use integral SIMD instructions, rather than floating-point ones. — Anon., Dec 22 '09 at 00:48
I do not need to do anything floating point, so you're right, MMX integer instructions are perfectly adequate. — Ben Englert, Dec 22 '09 at 01:08
However, the question of how to efficiently turn a buffer of raw byte pixels into integers so I can potentially SIMD four at once remains. — Ben Englert, Dec 22 '09 at 01:39
Remember to use aligned buffers for your __m128 and __m64 data types. At least on some platforms, they have stricter alignment requirements than your C compiler will guarantee. — Adrian McCarthy, Dec 22 '09 at 15:40

score 1 · Accepted Answer · answered Dec 22 '09 at 05:53

1

If you are willing to use MMX...

MMX gives you a bunch of 64 bit registers that can treat each register as 8, 8-bit values.

Like the 8-bit values you're working with.

There's a good primer here.

answered Dec 22 '09 at 05:53

Drew Dormann

59,987
13
123
180

1

that's not actually the best suggestion as MMX is becoming obsolete and there's SSE2 which performs almost twice as fast as MMX – May 10 '10 at 17:18

score 1 · Answer 2 · answered Dec 22 '09 at 15:52

I think your performance bottleneck could come from the casting to float, that is a rather expensive operation.

If I remember well, that casting is about 50 clock cycles in most architectures... and considering the worst case in which the FP multiplications could take, let's say, about 4 clocks each one with no overlapping in the pipeline, doing all of them in parallel in 1 cycle could save you 15 cycles at most, still no gain.

I'd definitively go for working always with the same number format (integer in this case), if streamed with MMX like Shmoopty said, then better.

score 0 · Answer 3 · edited May 23 '17 at 11:48

First, the data you're copying from (I'm guessing it's pointed to by that void* pointer) should be memory aligned for optimal performance - if not copy it to a memory aligned buffer.

Second, you can still use SSE2 once you've moved your data into a memory aligned buffer, it's quite easy - I used the code here without any issues with the intrinsics (but had problems with the assembly as detailed here).

Hope this is useful - I too worked with images and stored them as unsigned char in the main memory and copied them to the SSE2 registers (made sense since R,G, or B varied from 0-255) - but I used the assembly code since I felt it was easier.

But if you want to make it cross-platform, I suppose using the intrinsics would be cleaner.

Good luck!

Processing byte pixels with SSE/SSE2 intrinsics in C

3 Answers3