SSE2 code optimization to compress an image

Question

I want to optimize the for loop with SSE/SSE2 instructions for a better time in image compression.

size_t height = get_height();
size_t width = get_width();
size_t total_size = height * width * 3;
uint8_t *src = get_pixels();
uint8_t *dst = new uint8_t[total_size / 6];
uint8_t *tmp = dst;
rgb_t block[16];

if (height % 4 != 0 || width % 4 != 0) {
    cerr << "Texture compression only supported for images if width and height are multiples of 4" << endl;
    return;
}

// Split image in 4x4 pixels zones
for (unsigned y = 0; y < height; y += 4, src += width * 3 * 4) {
    for (unsigned x = 0; x < width; x += 4, dst += 8) {
        const rgb_t *row0 = reinterpret_cast<const rgb_t*>(src + x * 3);
        const rgb_t *row1 = row0 + width;
        const rgb_t *row2 = row1 + width;
        const rgb_t *row3 = row2 + width;

        // Extract 4x4 matrix of pixels from a linearized matrix(linear memory).
        memcpy(block, row0, 12);
        memcpy(block + 4, row1, 12);
        memcpy(block + 8, row2, 12);
        memcpy(block + 12, row3, 12);

        // Compress block and write result in dst.
        compress_block(block, dst);
    }
}

How can I read from memory an entire line from matrix with sse/sse2 registers when a line is supposed to have 4 elements of 3 bytes? The rgb_t structure has 3 uint_t variables.

I would suggest **either** write general purpose C and let the compiler optimize it for the target ( -O3 or more with specific switches ) as it will usually optimize better than human's write code, **or** explicitly code in assembler for your target ( not something I'd normally recommend ). Use data structures and e.g. vector types to assist the compiler ( by giving it information ) but let the compiler do the dirty work of generating optimized code. Try to avoid making data copies ( e.g. use pointers ). And **reinterpret_cast** is not C. — StephenG - Help Ukraine, Dec 27 '16 at 17:52
@StephenG: neither is `cerr << ..`. As to "premature optimization", knowing what language you are actually programming in is kind of vital. — Jongware, Dec 27 '16 at 19:39
`row0` can be a regular `uint_8 *`, the same type as its `src`. You are not using it as its own type, apart from the additions. You can move `row0` to above both loops and only increment it inside. Also - a smart compiler will notice this - the temporary variables `row1` etc. are only used once. That said, you may want to inspect how this gets compiled, before attempting further manual optimization. — Jongware, Dec 27 '16 at 19:45

score 1 · Answer 1 · answered Dec 27 '16 at 21:51

Why do you think the compiler doesn't already make good code for those 12-byte copies?

But if it doesn't, probably copying 16 bytes for the first three copies (with overlap) will let it use SSE vectors. And padding your array would let you do the last copy with a 16-byte memcpy which should compile to a 16-byte vector load/store, too:

alignas(16) rgb_t buf[16 + 4];

Aligning probably doesn't matter much, since only the first store will be aligned anyway. But it might help the function you're passing the buffer to as well.

SSE2 code optimization to compress an image

1 Answers1