How to prepare data for use with MMX/SSE intrinsics for shifting 16bit values?

Question

No matter what I do with {0,8,16,0}(16bit vector, representation for copying into a big endian 64bit value) I am unable to properly bit shift a test value of { 0x00, 0x01, (...) 0x07 };
The result I get in the debugger is always 0x0.

I tried to convert the value in a couple of different ways, but I am unable to get this right.

Executed on a little endian:

#include <mmintrin.h>
#include <stdint.h>

int main(int argc, char** argv) {
    __m64 input;
    __m64 vectors;
    __m64 output;

    _Alignas(8) uint16_t bit16Vectors[1*4] = {
        0x0000,0x0008,0x0010,0x0000
        // Intent: {0,8,16,0} 16 bit array
        // Convert for copy: {0,16,8,0} 64bit one item
        // 8bit data, Bytes need to rotate: {0,8,16,0}
    };
    _Alignas(8) uint8_t in[8] = {
        0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
    };

    input = _m_from_int64(*((long long*)in) );
    vectors = _m_from_int64 (*((long long*)bit16Vectors));
    output = _mm_sll_pi16(input, vectors);
    __asm__("int3");
}

I wrote down a simple MMX-only RGB24 plane separation pseudoAssembly[which processes 8x1 values], but I am unable to convert all the 16+32bit bit shift vectors to "real world", or I do something wrong with the intrinsics.

I am unable to pin it down exactly, I just know it fails at the very first bit shift and returns the value of 0x0.

`_mm_sll_pi16` takes a single shift-count in the low (or only) 64 bits of the 2nd arg (so it's a huge number and shifts out all the bits, as documented: https://www.felixcloutier.com/x86/psllw:pslld:psllq). I think you want AVX512BW `_mm_sllv_epi16` to shift each 16-bit element by the corresponding 16-bit element. https://www.felixcloutier.com/x86/vpsllvw:vpsllvd:vpsllvq . Before AVX-512, there's AVX2 `_mm_sllv_epi32`, but before that all x86 SIMD shifts use the same count for all elements. You could use `pmullw` with power-of-2 multipliers, but that's lower throughput and higher latency. — Peter Cordes, May 04 '22 at 09:49
Also, `*((long long*)in` is strict-aliasing UB. Use any load intrinsic that takes a pointer. BTW, what are you hoping to gain from MMX instead of SSE2? Is this for some retrocomputing thing where you care about Pentium 3 and Athlon XP? Or bare metal without enabling SSE, only the x87 unit? — Peter Cordes, May 04 '22 at 09:50
You shift counts(?) all seem to be multiples of 8 which makes sense for RGB24, so you likely just want to use byte shuffles like `punpcklbw` or in general SSSE3 `pshufb`. Or fixed shifts + and/andn/or blends. — Peter Cordes, May 04 '22 at 09:59
What I wrote IIRC gets down to 45 shifts and 6 additions; I probably would use ``pshufb``, for simplicity sake, but yeah I do plan running this on P3, initially it was meant to be P2 hence movq loads. — Nieważne Nieważne, May 04 '22 at 10:22
They are all multiples of 8, because I am trying to shift bytes around. And it was generally way easier for me to imagine it as bit shifts than punpckXXXX, because it adds to the writting+debugging complexity; With punpckXXXX I need think about how to setup data on 2 variables at once . — Nieważne Nieważne, May 04 '22 at 10:22
I don't have an AVX5 enabled computer, probably the cheapest thing I could find would be an lga 1700 i3/pentium/celeron, not sure if pentium/celeron has AVX5, ark.intel only lists avx256 for all LGA1700 cpus — Nieważne Nieważne, May 04 '22 at 10:22
I'm not suggesting you buy a CPU with AVX-512, just telling you the instruction you thought you could use doesn't exist in MMX or SSE*. (And no, there aren't any Pentium/Celeron CPUs with AVX-512; they only support half the vector width of the corresponding generation i3/5/7/9. So for example, Skylake Pentium was crippled with only SSE4.2, no AVX2 or FMA. Ice Lake Pentium finally has AVX2.) — Peter Cordes, May 04 '22 at 10:30
SSSE3 `pshufb` can emulate any per-element shift as long as the counts are multiples of 8, and it's a compile-time constant so you can hand-code which bytes need to get zeroed and which destination bytes come from which source bytes. But that's not available until Core 2 (Conroe/Merom from 2006), and not fast until 2nd-gen Core 2 (45nm Penryn / Wolfdale). — Peter Cordes, May 04 '22 at 10:32
45 shifts and 6 additions to do what? That sounds like a lot. If you mean to sort out one qword of two RGB24 pixels into separate planes? That should take at worst two `movd` 32-bit loads (or 1 qword and `psrldq` byte-shift or `psrlq` bit-shift), 1 `punpcklbw`, and 3 word-extracts like `pextrw`. But that's pretty inefficient, and another unpack step could get 4 consecutive red and 4 green bytes into dwords of an XMM or MMX reg. (But getting more work done by working with more pixels at once gets hairier when they they're 3 bytes long, not a nice power of 2.) — Peter Cordes, May 04 '22 at 10:37
Sometimes it helps to do an unaligned load that cuts the data you want in half, so one pixel is in the top 3 bytes of the low dword, another pixel is in the low 3 bytes of the high dword. Or the equivalent for a pair of pixels in the low qword of an XMM, with the low 2 bytes being leftover garbage. — Peter Cordes, May 04 '22 at 10:39
Anyway, since the building-block you wanted isn't available (until AVX-512), you're going to want to think again. — Peter Cordes, May 04 '22 at 10:39

How to prepare data for use with MMX/SSE intrinsics for shifting 16bit values?

0 Answers0