On Powerpc, is there any equivalent of intel's movemask intrinsics?

Question

I'd like to merge all elements in a __vector bool long long into a single int, in which each bit is set to the most significant bit of the input vector

example:

__vector bool long long vcmp = vec_cmplt(a, b);
int packedmask = /*SOME FUNCTION GOES HERE*/ (vcmp);

with

packedmask = x|y|0000000000000000....

where x equals 1 if vcmd[0] = 0XFFFFF... or 0 if vcmp[0] = 0; same for y.

On intel , we can achieve this by using _mm_movemask instructions (intrinsic for intel)

Is there any way to do the same on PowerPC?

Thank you for any help

ErmIg · Answer 1 · 2015-11-26T13:24:07.420

You can try something like this:

typedef __vector uint8_t v128_u8;
typedef __vector uint32_t v128_u32;

const v128_u8 KS = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};
const v128_u8 K0 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
const v128_u8 K1 = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
//const v128_u8 KP = {0, 8, 4, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};//little endian
const v128_u8 KP = {3, 11, 7, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};//big-endian

unit Tmp
{
     uint32_t u32;
     uint16_t u16[2];
};

uint16_t vec_movemask(v128_u8 value)
{
    Tmp tmp
    tmp.u32 = vec_extract(vec_perm(vec_msum(vec_and(value, KS), K1, K0), KP), 0);
    return tmp.u16[0] + tmp.u16[2];
}

Detailed:

value:
{0x00, 0xff, 0x00, 0x00, 0xff, 0xff, 0x00, 0xff, 0x00, 0x00, 0xff, 0xff , 0x00, 0xff, 0x00, 0xff};
vec_and(value, KS):
{0x00, 0x02, 0x00, 0x00, 0x10, 0x20, 0x00, 0x80, 0x00, 0x00, 0x04, 0x08 , 0x00, 0x20, 0x00, 0x80};
vec_msum(vec_and(value, KS), K1, K0):
{0x00, 0x00, 0x00, 0x02, 0x00, 0x00, 0x00, 0xB0, 0x00, 0x00, 0x00, 0x0C , 0x00, 0x00, 0x00, 0xA0};
vec_perm(vec_msum(vec_and(value, KS), K1, K0):
{0x02, 0x0C, 0xB0, 0xA0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 , 0x00, 0x00, 0x00, 0x00};
vec_extract(vec_perm(vec_msum(vec_and(value, KS), K1, K0):
{0x02, 0x0C, 0xB0, 0xA0}
tmp.u16[0] + tmp.u16[2]:
{0xB2, 0xAC}

whoa. Thanks. But I guess I'll stay with the scalar code if there isn't involving less instructions/constants. — Regis Portalez, Nov 26 '15 at 13:25
Power7/8 has 64 vector registers. Constant vectors will be in registers if they will be used often. — ErmIg, Nov 26 '15 at 13:32

Jeremy Kerr · Accepted Answer · 2015-11-27T04:26:24.550

Sounds like the the vbpermq instruction (and vec_vbpermq() intrinsic) would be appropriate here. Given a vector of unsigned char "indicies" (ie., 0 - 128), it uses those indexes to select a bit into an output vector. If the index is greater than 128, a zero bit is used instead.

The 16 resulting bits are zero-extended to form a 64-bit value in the first doubleword of the result vector.

Something like this could work:

/*
 * our permutation indicies: the MSbit from the first bool long long,
 * then the MSbit from the second bool long long, then the rest as
 * >=128 (which gives a zero bit in the result vector)
 */
vector unsigned char perm = { 0, 64, 128, 128, 128, /*...*/};

/* compare the two-item vector into two bools */
vcmp = (vector unsigned char)vec_cmplt(a, b);

/* select a bit from each of the result bools */
result = vec_vbpermq(vcmp, perm);

Getting the int out of the result vector will depend on what you want to do with it. If you need that as is, a vec_extract(result, 0) might work, but since you're only interested in the top two bits of the result, you may be able to simplify the perm constant, and/or shift the result as appropriate.

Also, be aware of endian considerations of your result.

vbpermq is described in section 5.15 of the PowerISA.

On Powerpc, is there any equivalent of intel's movemask intrinsics?

2 Answers2