4

The following function seems to not be available on AVX512:

__m512i _mm512_sign_epi16 (__m512i a, __m512i b)

Will it available soon or is there an alternative?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Interesting, you're right, `psignw` / `vpsignw` asm instructions https://www.felixcloutier.com/x86/psignb:psignw:psignd only go up to AVX2. Looking for an equivalent... – Peter Cordes Apr 18 '19 at 09:21
  • I can not find an equivalent,It seems that we need several functions to do psignw / vpsignw。Could you help me? – yueluojieying Apr 18 '19 at 09:33
  • If a single-instruction replacement exists, it doesn't have `sign` in the mnemonic or one-line description in the asm instruction-set reference. (extracted from Intel's vol.2 PDF): https://www.felixcloutier.com/x86/index.html I looked through the matches. (And found some neat stuff I'd forgotten existed, like `vrangeps` that can clamp float magnitude without affecting sign, vs. needing 2 separate min/max instructions) – Peter Cordes Apr 18 '19 at 09:33
  • The most obvious workaround would be to split `a` and `b` into two halves, apply `vpsignw` and joint the results again. Not sure if there are more efficient alternatives. – chtz Apr 18 '19 at 09:34
  • 1
    @chtz: If you don't need the zeroing part, you can compare (or `test`) into a mask and do a merge-masked subtract from zero to negate a vector based on the signs of another. – Peter Cordes Apr 18 '19 at 09:37

2 Answers2

7

If you don't need the zeroing part, you only need 2 instructions (and a zeroed register):

You can _mm512_movepi16_mask() the sign bits into a mask (AVX512 version of pmovmskb), and do a merge-masked subtract from zero to negate a vector based on the signs of another.

#ifdef __AVX512BW__
// does *not* do anything special for signs[i] == 0, just negative / non-negative
__m512i  conditional_negate(__m512i target, __m512i signs) {
    __mmask32 negmask = _mm512_movepi16_mask(signs);
      // vpsubw target{k1}, 0, target
    __m512i neg = _mm512_mask_sub_epi16(target, negmask, _mm512_setzero_si512(), target);
    return neg;
}
#endif

vector -> mask has 3 cycle latency on Skylake-X (with vpmovw2m, vptestmw, or vpcmpw), but using the mask only has another 1 cycle latency. So the latency from inputs to outputs are:

  • 4 cycles from signs -> result on SKX
  • 1 cycle from target -> result on SKX (just the masked vpsubw from zero.)

To also apply the is-zero condition: you may be able to zero-mask or merge-mask the next operation you do with the vector, so the elements that should have been zero are unused.

You need an extra compare to create another mask, but you probably don't need to waste a 2nd extra instruction to apply it right away.

If you really want to build a self-contained vpsignw this way, we can do that final zero-masking, but this is 4 intrinsics that compile to 4 instructions, and probably worse for throughput than @wim's min/max/multiply. But this has good critical-path latency, with about 5 cycles total on SKX (or 4 if you can fold the final masking into something else). The critical path is signs->mask, then masked sub. The signs->nonzeromask can run in parallel with either of those.

__m512i  mm512_psignw(__m512i target, __m512i signs) {
    __mmask32 negmask = _mm512_movepi16_mask(signs);
      // vpsubw target{negmask}, 0, target  merge masking to only modify elements that need negating
    __m512i neg = _mm512_mask_sub_epi16(target, negmask, _mm512_setzero_si512(), target);

    __mmask32 nonzeromask = _mm512_test_epi16_mask(signs,signs);  // per-element non-zero?
    return  _mm512_maskz_mov_epi16(nonzeromask, neg);        // zero elements where signs was zero
}

Possibly the compiler can fold this zero-masking vmovdqu16 instrinsic into merge-masking for add/or/xor, or zero-masking for multiply/and. But probably a good idea to do that yourself.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Good alternatives. Did you notice that the `vpmovw2m` and `vptestmw` instructions have a 3 cycle latency (see [uops.info](http://www.uops.info/index.html)) on Skylake X? That would make the critical-path latency about 5 cycles total (or 4, if it is possible to fold the final masking into something else)? Which is still 2 (or 3) cycles shorter than my solution. – wim Apr 18 '19 at 14:35
  • @wim: oops, no I foolishly assumed single-cycle latency like for AVX2 compare into a vector. It seems `vpcmpw` is also 3 cycles. Arithmetic right shift can broadcast the sign bit, but I don't think that helps us except as a setup to XOR/ADD for absolute value, and `vpabsw zmm` does still exist. – Peter Cordes Apr 18 '19 at 14:46
4

A possible solution is:

__m512i mm512_sign_epi16(__m512i a, __m512i b){
    /* Emulate _mm512_sign_epi16() with instructions  */
    /* that exist in the AVX-512 instruction set      */
    b = _mm512_min_epi16(b, _mm512_set1_epi16(1));     /* clamp b between -1 and 1 */
    b = _mm512_max_epi16(b, _mm512_set1_epi16(-1));    /* now b = -1, 0 or 1       */
    a = _mm512_mullo_epi16(a, b);                      /* apply the sign of b to a */
    return a;
}

This solution should have reasonable throughput, but the latency might not be optimal due to the integer multiply. A good alternative is Peter Cordes' solution which has better latency. But in practice high throughput is usually more of interest than low latency.

Anyway, the actual performance of the different alternatives (the solution here, Peter Cordes' answer, and the splitting idea in chtz' comment) depends on the surrounding code and the type of CPU that executes the instructions. You'll have to benchmark the alternatives to see which one is fastest in your particular case.

wim
  • 3,702
  • 19
  • 23
  • 1
    I doubt that shuffling wins; you'd need to extract *both* inputs (2x `VEXTRACTI32x8` + 2x AVX2 `vpsignw`), then shuffle the output back together (1x `VINSERTI32x8`). See my answer for a 4-instruction low latency solution with masking. Also, on Skylake-AVX512, only 2 vector ALU ports are active for 512-bit vectors, and they both support multiply uops. (Or one of them does, on Xeon Bronze or w/e with only one FMA unit.) So uop count is more important than before vs. latency. – Peter Cordes Apr 18 '19 at 12:15
  • 1
    @PeterCordes: You are right. I forgot that two `VEXTRACTI32x8`s are needed in chtz' case – wim Apr 18 '19 at 14:34