2

I'd like to merge all elements in a __vector bool long long into a single int, in which each bit is set to the most significant bit of the input vector

example:

__vector bool long long vcmp = vec_cmplt(a, b);
int packedmask = /*SOME FUNCTION GOES HERE*/ (vcmp);

with

packedmask = x|y|0000000000000000....

where x equals 1 if vcmd[0] = 0XFFFFF... or 0 if vcmp[0] = 0; same for y.

On intel , we can achieve this by using _mm_movemask instructions (intrinsic for intel)

Is there any way to do the same on PowerPC?

Thank you for any help

Regis Portalez
  • 4,675
  • 1
  • 29
  • 41

2 Answers2

3

You can try something like this:

typedef __vector uint8_t v128_u8;
typedef __vector uint32_t v128_u32;

const v128_u8 KS = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};
const v128_u8 K0 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
const v128_u8 K1 = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1};
//const v128_u8 KP = {0, 8, 4, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};//little endian
const v128_u8 KP = {3, 11, 7, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};//big-endian

unit Tmp
{
     uint32_t u32;
     uint16_t u16[2];
};

uint16_t vec_movemask(v128_u8 value)
{
    Tmp tmp
    tmp.u32 = vec_extract(vec_perm(vec_msum(vec_and(value, KS), K1, K0), KP), 0);
    return tmp.u16[0] + tmp.u16[2];
}

Detailed:

value:
{0x00, 0xff, 0x00, 0x00, 0xff, 0xff, 0x00, 0xff, 0x00, 0x00, 0xff, 0xff , 0x00, 0xff, 0x00, 0xff};
vec_and(value, KS):
{0x00, 0x02, 0x00, 0x00, 0x10, 0x20, 0x00, 0x80, 0x00, 0x00, 0x04, 0x08 , 0x00, 0x20, 0x00, 0x80};
vec_msum(vec_and(value, KS), K1, K0):
{0x00, 0x00, 0x00, 0x02, 0x00, 0x00, 0x00, 0xB0, 0x00, 0x00, 0x00, 0x0C , 0x00, 0x00, 0x00, 0xA0};
vec_perm(vec_msum(vec_and(value, KS), K1, K0):
{0x02, 0x0C, 0xB0, 0xA0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 , 0x00, 0x00, 0x00, 0x00};
vec_extract(vec_perm(vec_msum(vec_and(value, KS), K1, K0):
{0x02, 0x0C, 0xB0, 0xA0}
tmp.u16[0] + tmp.u16[2]:
{0xB2, 0xAC}
ErmIg
  • 3,980
  • 1
  • 27
  • 40
  • whoa. Thanks. But I guess I'll stay with the scalar code if there isn't involving less instructions/constants. – Regis Portalez Nov 26 '15 at 13:25
  • 3
    Power7/8 has 64 vector registers. Constant vectors will be in registers if they will be used often. – ErmIg Nov 26 '15 at 13:32
2

Sounds like the the vbpermq instruction (and vec_vbpermq() intrinsic) would be appropriate here. Given a vector of unsigned char "indicies" (ie., 0 - 128), it uses those indexes to select a bit into an output vector. If the index is greater than 128, a zero bit is used instead.

The 16 resulting bits are zero-extended to form a 64-bit value in the first doubleword of the result vector.

Something like this could work:

/*
 * our permutation indicies: the MSbit from the first bool long long,
 * then the MSbit from the second bool long long, then the rest as
 * >=128 (which gives a zero bit in the result vector)
 */
vector unsigned char perm = { 0, 64, 128, 128, 128, /*...*/};

/* compare the two-item vector into two bools */
vcmp = (vector unsigned char)vec_cmplt(a, b);

/* select a bit from each of the result bools */
result = vec_vbpermq(vcmp, perm);

Getting the int out of the result vector will depend on what you want to do with it. If you need that as is, a vec_extract(result, 0) might work, but since you're only interested in the top two bits of the result, you may be able to simplify the perm constant, and/or shift the result as appropriate.

Also, be aware of endian considerations of your result.

vbpermq is described in section 5.15 of the PowerISA.

Jeremy Kerr
  • 1,895
  • 12
  • 24