Vectorized extraction of a specific pattern of shorts from an array, and also insertion into a new array

Question

I have an array of shorts where I want to grab half of the values and put them in a new array that is half the size. I want to grab particular values in this sort of pattern, where each block is 128 bits (8 shorts). This is the only pattern I will use, it doesn't need to be "any generic pattern"!

The values in white are discarded. My array sizes will always be a power of 2. Here's the vague idea of it, unvectorized:

unsigned short size = 1 << 8;
unsigned short* data = new unsigned short[size];

...

unsigned short* newdata = new unsigned short[size >>= 1];

unsigned int* uintdata = (unsigned int*) data;
unsigned int* uintnewdata = (unsigned int*) newdata;

for (unsigned short uintsize = size >> 1, i = 0; i < uintsize; ++i)
{
 uintnewdata[i] = (uintdata[i * 2] & 0xFFFF0000) | (uintdata[(i * 2) + 1] & 0x0000FFFF);
}

I started out with something like this:

static const __m128i startmask128 = _mm_setr_epi32(0xFFFF0000, 0x00000000, 0xFFFF0000, 0x00000000);
static const __m128i endmask128 = _mm_setr_epi32(0x00000000, 0x0000FFFF, 0x00000000, 0x0000FFFF);

__m128i* data128 = (__m128i*) data;
__m128i* newdata128 = (__m128i*) newdata;

and I can iteratively perform _mm_and_si128 with the masks to get the values I'm looking for, combine with _mm_or_si128, and put the results in newdata128[i]. However, I don't know how to "compress" things together and remove the values in white. And it seems if I could do that, I wouldn't need the masks at all.

How can that be done?

Anyway, eventually I will also want to do the opposite of this operation as well, and create a new array of twice the size and spread out current values within it.

I will also have new values to insert in the white blocks, which I would have to compute with each pair of shorts in the original data, iteratively. This computation would not be vectorizable, but the insertion of the resulting values should be. How could I "spread out" my current values into the new array, and what would be the best way to insert my computed values? Should I compute them all for each 128-bit iteration and put them into their own temp block (64 bit? 128 bit?), then do something to insert in bulk? Or should they be emplaced directly into my target __m128i, as it seems the cost should be equivalent to putting in a temp? If so, how could that be done without messing up my other values?

I would prefer to use SSE2 operations at most for this.

Pretty graphics. So are you trying to figure out how to do this, or how to optimize it? — Beta, Jan 07 '13 at 16:34
@Beta Well, I can do it without vectorization easily enough. So I want to find out how to do it with a minimal set of vectorized commands, which I guess means I want to optimize it. — user173342, Jan 07 '13 at 16:36
Doing this without vectorization will likely be faster (needless to say that it's a million times easier to code). You would need to do considerable shuffling, and shuffling is hardcoded on SSE2. So unless you always only have the same patterns, you need to write self-modifying code or branch (both is detrimental to performance). Or, you need to use SSE3, but even then it's probably faster to just write plain normal C++ code. — Damon, Jan 07 '13 at 16:56
@Damon It's always the exact same pattern. I have no problem with creating a vectorized implementation for this, and would like to do it just to see how it's done. — user173342, Jan 07 '13 at 16:59
Well, you'd be using a combination of `_mm_shuffle_epi32`, `_mm_shufflelo_epi16`, and `_mm_shufflehi_epi16` until you either have the full pattern, Or, until you have some partial patterns that you mix together with `_mm_or_si128` at the end. There is no full `epi16` shuffle operation in SSE2 if I remember correctly (only `hi`/`lo` versions, and the MMX `pi16` version). Really, just write C code. — Damon, Jan 07 '13 at 17:08
Twiddling with the data layout is a very poor use of SSE. Depending on the size of the actual data, you'll probably hit memory bandwidth limitations before the data twiddling becomes a bottleneck. — doug65536, Jan 07 '13 at 19:05
@doug65536: It really depends. With SSE you want to keep processing in the SSE registers to get good throughput and often it means some twiddling. The cost of going out of SSE, twiddling, going back to SSE, is huge and many calculations (e.g. wavelets) require twiddling. — Guy Sirton, Jan 07 '13 at 19:25
The profiler shows the data-twiddle code path being the bottleneck? If not you're wasting your time trying to optimize this. — doug65536, Jan 07 '13 at 20:03

Guy Sirton · Accepted Answer · 2013-01-07T19:36:44.497

Here's an outline you can try:

Use the interleave instruction ( _mm_unpackhi/lo_epi16 ) with a register containing zero to "spread out" your 16 bit values. Now you'll have two registers looking like B_R_B_R_.
Shift right creating _B_R_B_R
AND the R's out of the first version B___B___
AND the B's out of the second version ___R___R
OR together B__RB__R

In the other direction use _mm_packs_epi32 in the end after setting it up with shift/and/or.

Each direction should be 10 SSE instructions (not counting the constants setup, zero and the AND masks, and the load/store).

Vectorized extraction of a specific pattern of shorts from an array, and also insertion into a new array

1 Answers1