QWORD shuffle sequential 7-bits to byte-alignment with SIMD SSE...AVX

Question

I would like to know if the following is possible in any of the SIMD families of instructions.

I have a qword input with 63 significant bits (never negative). Each sequential 7 bits starting from the LSB is shuffle-aligned to a byte, with a left-padding of 1 (except for the most significant non-zero byte). To illustrate, I'll use letters for clarity's sake.

The result is only the significant bytes, thus 0 - 9 in size, which is converted to a byte array.

In:         0|kjihgfe|dcbaZYX|WVUTSRQ|PONMLKJ|IHGFEDC|BAzyxwv|utsrqpo|nmlkjih|gfedcba
Out: 0kjihgfe|1dcbaZYX|1WVUTSRQ|1PONMLKJ|1IHGFEDC|1BAzyxwv|1utsrqpo|1nmlkjih|1gfedcba

Size = 9

In:  00|nmlkjih|gfedcba
Out: |0nmlkjih|1gfedcba

Size = 2

I do understand the padding is separate. The shuffle-aligning is my question. Is this possible?

EDIT 2

Here is my updated code. Gets a sustained 46 M / sec for random-length input on single thread Core 2 Duo 2 GHz, 64 bit.

private static int DecodeIS8(long j, ref byte[] result)
{
    if (j <= 0)
    {
        return 0;
    }

    int size;

    // neater code: gives something to break out of
    while (true)
    {
        result[0] = (byte)((j & 0x7F) | 0x80);
        size = 0;
        j >>= 7;

        if (j == 0) break;

        result[1] = (byte)((j & 0x7F) | 0x80);
        size++;
        j >>= 7;

        if (j == 0) break;

        result[2] = (byte)((j & 0x7F) | 0x80);
        size++;
        j >>= 7;

        if (j == 0) break;

        result[3] = (byte)((j & 0x7F) | 0x80);
        size++;
        j >>= 7;

        if (j == 0) break;

        result[4] = (byte)((j & 0x7F) | 0x80);
        size++;
        j >>= 7;

        if (j == 0) break;

        result[5] = (byte)((j & 0x7F) | 0x80);
        size++;
        j >>= 7;

        if (j == 0) break;

        result[6] = (byte)((j & 0x7F) | 0x80);
        size++;
        j >>= 7;

        if (j == 0) break;

        result[7] = (byte)((j & 0x7F) | 0x80);
        size++;
        j >>= 7;

        if (j == 0) break;

        result[8] = (byte)j;

        return 9;
    }

    result[size] ^= 0x80;

    return size + 1;
}

It's certainly doable, but it's going to be ugly. You'll need a bunch of shift and mask operations. Are you 100% certain that this is a performance bottleneck ? — Paul R, Jun 01 '12 at 12:22
It's a research question. It is code that needs to be as fast as possible because a lot of other code calls it. In other words, it is very, very core. The question is, will it be faster? — IamIC, Jun 01 '12 at 12:27
The only way to be sure is to code up a scalar version and a SIMD version and benchmark them. If you're not doing other SIMD operations in conjunction with this unpacking though then I suspect you won't gain much. — Paul R, Jun 01 '12 at 12:31
What CPU and compiler are you using ? You might find that a decent compiler (e.g. Intel's ICC or even just gcc) will give you more improvement than going to SIMD for this. — Paul R, Jun 01 '12 at 12:33
I am going to have to test it. SIMD is very new to me, which is why I haven't already tried. This task *looks* like something one would use SIMD for, so I posted the question here to get the opinion of those who are experienced in this area. — IamIC, Jun 01 '12 at 12:40
CPU = i7, GCC (for now). I'm not sure the compiler is going to do much with the code I posted, though. — IamIC, Jun 01 '12 at 12:43
This page (http://software.intel.com/en-us/blogs/2011/06/13/haswell-new-instruction-descriptions-now-available/) indicates "many-to-many permutes", but doesn't say much about it. — IamIC, Jun 01 '12 at 12:46
You didn't mention whether you're doing other SIMD operations on this data. If you're just loading the packed data from memory and storing the unpacked data back to memory then it's unlikely that SIMD will help. I would start by optimising the existing scalar code as much as possible though, before considering SIMD. — Paul R, Jun 01 '12 at 13:28
@PaulR, there are no other SIMD operations on the data. This is purely a scalar function. I think I've got it as optimized as I can. I'm getting about 41 M decodes / sec of purely random length inputs. When the code was inline, it was 55 M. — IamIC, Jun 01 '12 at 13:47
That seems a little slow - around 50 clocks per decode - I would expect that the scalar code could be optimised to run considerably faster than that. — Paul R, Jun 01 '12 at 13:57
Spent 1.5 h writing out as a hyper-complex binary operation. It's deal slow by comparison. 50 clocks isn't that bad if you consider how many actual operations are taking place. — IamIC, Jun 01 '12 at 15:40
Consider too that it is having to shuffle the result into the byte array. That's probably taking up half the time. — IamIC, Jun 01 '12 at 15:44
Suggestion: I would take the code in your question above as a starting point, get rid of all the early return stuff and conditionals (i.e. make it branchless) - process 9 output bytes every time without any branches and then just write out the correct number of bytes at the end. — Paul R, Jun 01 '12 at 16:14
Paul, thanks for the suggestion, which I will try, but I can guarantee that will be slower for the same reason that the massive bit-shuffler is slower: the branching allows the CPU to only process the needed number of bytes. I'll let you know how it turns out. Thanks for your help :) — IamIC, Jun 01 '12 at 17:38
For purely random length inputs (1 - 8 significant bytes), your suggested is only slightly slower that the original. For large numbers, it would be faster, and for small numbers, it would be slower. In C, it would speed up because of the bit-scan-left command to get the MSB. In this case, it could be the winner. — IamIC, Jun 01 '12 at 18:00
OK - I was assuming that the cost of branch mispredictions when the input length is random would outweigh the additional processing, but evidently not. — Paul R, Jun 01 '12 at 18:16
I took your suggestion and implemented it in a way that did speed things up. I posted the update. — IamIC, Jun 01 '12 at 18:33
Looks like around 50 arithmetic/logical instructions so that's consistent with the throughput you're getting. — Paul R, Jun 01 '12 at 19:11
Considering the overhead of the method call, I'd say yes. I don't think I'm going to squeeze much more out of this... unless AVX2 has a solution. Apparently it just might. Thanks Paul :) — IamIC, Jun 01 '12 at 19:15
Pity you're not doing this on POWER/PowerPC - AltiVec has 128 bit shifts. But I think it might still be doable with SSE in fewer instructions than with your scalar code - good luck, anyway! — Paul R, Jun 01 '12 at 22:20

mattst88 · Accepted Answer · 2016-03-13T02:02:22.773

Yes, it's possible to use MMX/SSE's pmullw instruction (intrinsic function: _mm_mullo_pi16) to do per-element shifts.

The basic idea is to extract alternating 7-bit elements with an AND instruction and perform the pmullw to shift the elements into place. This will accomplish the task for half of the elements, so the process will need to be repeated with a couple of extra shifts.

#include <stdio.h>
#include <stdint.h>
#include <mmintrin.h>

__m64 f(__m64 input) {
    static const __m64 mask = (__m64) 0xfe03f80fe03f80UL;
    static const __m64 multiplier = (__m64) 0x0080002000080002UL;

    __m64 t0 = _mm_and_si64 (input, mask);
    __m64 t1 = _mm_and_si64 (_mm_srli_si64 (input, 7), mask);

    t0 = _mm_mullo_pi16 (t0, multiplier);
    t1 = _mm_mullo_pi16 (t1, multiplier);

    __m64 res =  _mm_or_si64 (t0, _mm_slli_si64 (t1, 8));
    /* set most significant bits, except for in most significant byte */
    return _mm_or_si64 (res, (__m64) 0x0080808080808080UL);
}

int main(int argc, char *argv[])
{
    int i;
    typedef union {
            __m64 m64;
            unsigned char _8x8[8];
    } type_t;

    /* 0x7f7e7c7870608080 = {127, 63, 31, 15, 7, 3, 2, 1, 0} */
    type_t res0 = { .m64 = f((__m64) 0x7f7e7c7870608080UL) };

    for (i = 0; i < 8; i++) {
            printf("%3u ", res0._8x8[i]);
    }
    puts("");

    return 0;
}

The mask extracts alternating 7-bit elements. The multiplier is a constant which allows us to specify per-element shifts. It's derived from looking at the masked input:

00000000|dcbaZYX0|000000PO|NMLKJ000|0000BAzy|xwv00000|00nmlkji|h0000000

and realizing that

00000000|dcbaZYX0 needs to be shifted by 7 (or multiplied by 2^7, 128, 0x0080)
000000PO|NMLKJ000 needs to be shifted by 5 (or multiplied by 2^5,  32, 0x0020)
0000BAzy|xwv00000 needs to be shifted by 3 (or multiplied by 2^3,   8, 0x0008)
00nmlkji|h0000000 needs to be shifted by 1 (or multiplied by 2^1,   2, 0x0002)

This function writes 8-bytes at a time (instead of the 9-bytes your 9 7-bit elements would unpack to), so you'll have to advance the source pointer by only 7-bytes after each iteration. Because of this, a conversion to SSE2 is a bit more complicated.

I don't think it's possible to use a different mask and multiplier for t1 in order to avoid the shifts, since t1's elements will cross 16-bit boundaries, which will prevent pmullw from working. But, it still might be possible to optimize somehow.

I haven't benchmarked this, but I suspect it's significantly faster than your scalar version. If you benchmark it, please post the results. I'd be very interested to see them.

All in all, the algorithm comes out to be 2 shifts, 2 ors, 2 ands, and two multiplies (and a few moves) to generate 8-bytes.

Very nice! I knew it was possible. Thank you. I believe a bsl is needed in order to detect the MSB, so as to correctly set the high bits for each byte. — IamIC, Jun 12 '12 at 10:19
Can you save an insn by using a `PANDN` to invert the mask, instead of shifting the data to line up with the mask, and then back again (by 1 bit more)? Or maybe just by having two different masks and two different multipliers. (Maybe only worth it if the loop is very hot, and / or runs on large buffers.) Very nice idea to comb the input into two halves, allowing `mul` to shift things without stepping on neighbouring elements. — Peter Cordes, Jul 04 '15 at 14:23

QWORD shuffle sequential 7-bits to byte-alignment with SIMD SSE...AVX

1 Answers1

Linked