5

I want to see if it's possible to write some generic SIMD code that can compile efficiently. Mostly for SSE, AVX, and NEON. A simplified version of the problem is: Find the maximum absolute value of an array of floating point numbers and return both the value and the index. It is the last part, the index of the maximum, that causes the problem. There doesn't seem to be a very good way to write code that has a branch.

See update at end for finished code using some of the suggested answers.

Here's a sample implementation (more complete version on godbolt):

#define VLEN 8
typedef float vNs __attribute__((vector_size(VLEN*sizeof(float))));
typedef int vNb __attribute__((vector_size(VLEN*sizeof(int))));
#define SWAP128 4,5,6,7, 0,1,2,3
#define SWAP64 2,3, 0,1,  6,7, 4,5
#define SWAP32 1, 0,  3, 2,  5, 4,  7, 6

static bool any(vNb x) {
    x = x | __builtin_shufflevector(x,x, SWAP128);
    x = x | __builtin_shufflevector(x,x, SWAP64);
    x = x | __builtin_shufflevector(x,x, SWAP32);
    return x[0];
}

float maxabs(float* __attribute__((aligned(32))) data, unsigned n, unsigned *index) {
    vNs max = {0,0,0,0,0,0,0,0};
    vNs tmax;
    unsigned imax = 0;
    for (unsigned i = 0 ; i < n; i += VLEN) {
        vNs t = *(vNs*)(data + i);
        t = -t < t ? t : -t;  // Absolute value
        vNb cmp = t > max;
        if (any(cmp)) {
            tmax = t; imax = i;
            // broadcast horizontal max of t into every element of max
            vNs tswap128 = __builtin_shufflevector(t,t, SWAP128);
            t = t < tswap128 ? tswap128 : t;
            vNs tswap64 = __builtin_shufflevector(t,t, SWAP64);
            t = t < tswap64 ? tswap64 : t;
            vNs tswap32 = __builtin_shufflevector(t,t, SWAP32);
            max = t < tswap32 ? tswap32 : t;
        }
    }
    // To simplify example, ignore finding index of true value in tmax==max
    *index = imax; // + which(tmax == max);
    return max[0];
}

Code on godbolt allows changing VLEN to 8 or 4.

This mostly works very well. For AVX/SSE the absolute value becomes t & 0x7fffffff using a (v)andps, i.e. clear the sign bit. For NEON it's done with vneg + fmaxnm. The block to find and broadcast the horizontal max becomes an efficient sequence of permute and max instructions. gcc is able to use NEON fabs for absolute value.

The 8 element vector on the 4 element SSE/NEON targets works well on clang. It uses a pair of instructions on two sets of registers and for the SWAP128 horizontal op will max or or the two registers without any unnecessary permute. gcc on the other hand really can't handle this and produces mostly non-SIMD code. If we reduce the vector length to 4, gcc works fine for SSE and NEON.

But there's a problem with if (any(cmp)). For clang + SSE/AVX, it works well, vcmpltps + vptest, with an orps to go from 8->4 on SSE.

But gcc and clang on NEON do all the permutes and ORs, then move the result to a gp register to test.

Is there some bit of code, other than architecture specific intrinsics, to get ptest with gcc and vmaxvq with clang/gcc and NEON?

I tried some other methods, like if (x[0] || x[1] || ... x[7]) but they were worse.

Update

I've created an updated example that shows two different implementations, both the original and "indices in a vector" method as suggested by chtz and shown in Aki Suihkonen's answer. One can see the resulting SSE and NEON output.

While some might be skeptical, the compiler does produce very good code from the generic SIMD (not auto-vectorization!) C++ code. On SSE/AVX, I see very little room to improve the code in the loop. The NEON version still troubled by a sub-optimal implementation of "any()".

Unless the data is usually in ascending order, or nearly so, my original version is still fastest on SSE/AVX. I haven't tested on NEON. This is because most loop iterations do not find a new max value and it's best to optimize for that case. The "indices in a vector" method produces a tighter loop and the compiler does a better job too, but the common case is just a bit slower on SSE/AVX. The common case might be equal or faster on NEON.

Some notes on writing generic SIMD code.

The absolute value of a vector of floats can be found with the following. It produces optimal code on SSE/AVX (and with a mask that clears the sign bit) and on NEON (the fabs instruction).

static vNs vabs(vNs x) {
    return -x < x ? x : -x;
}

This will do a vertical max efficiently on SSE/AVX/NEON. It doesn't do a compare; it produces the architecture's "max' instruction. On NEON, changing it to use > instead of < causes the compiler to produce very bad scalar code. Something with denormals or exceptions I guess.

template <typename v>  // Deduce vector type (float, unsigned, etc.)
static v vmax(v a, v b) {
    return a < b ? b : a; // compiles best with "<" as compare op
}

This code will broadcast the horizontal max across a register. It compiles very well on SSE/AVX. On NEON, it would probably be better if the compiler could use a horizontal max instruction and then broadcast the result. I was impressed to see that if one uses 8 element vectors on SSE/NEON, which have only 4 element registers, the compiler is smart enough to use just one register for the broadcasted result, since the top 4 and bottom 4 elements are the same.

template <typename v>
static v hmax(v x) {
    if (VLEN >= 8)
        x = vmax(x, __builtin_shufflevector(x,x, SWAP128));
    x = vmax(x, __builtin_shufflevector(x,x, SWAP64));
    return vmax(x, __builtin_shufflevector(x,x, SWAP32));
}

This is the best "any()" I found. It is optimal on SSE/AVX, using a single ptest instruction. On NEON it does the permutes and ORs, instead of a horizontal max instruction, but I haven't found a way to get anything better on NEON.

static bool any(vNb x) {
    if (VLEN >= 8)
        x |= __builtin_shufflevector(x,x, SWAP128);
    x |= __builtin_shufflevector(x,x, SWAP64);
    x |= __builtin_shufflevector(x,x, SWAP32);
    return x[0];
}

Also interesting, on AVX the code i = i + 1 will be compiled to vpsubd ymmI, ymmI, ymmNegativeOne, i.e. subtract -1. Why? Because a vector of -1s is produced with vpcmpeqd ymm0, ymm0, ymm0 and that's faster than broadcasting a vector of 1s.

Here is the best which() I've come up with. This gives you the index of the 1st true value in a vector of booleans (0 = false, -1 = true). One can do somewhat better on AVX with movemask. I don't know about the best NEON.

// vector of signed ints
typedef int vNi __attribute__((vector_size(VLEN*sizeof(int))));
// vector of bytes, same number of elements, 1/4 the size
typedef unsigned char vNb __attribute__((vector_size(VLEN*sizeof(unsigned char))));
// scalar type the same size as the byte vector
using sNb = std::conditional_t<VLEN == 4, uint32_t, uint64_t>;
static int which(vNi x) {
    vNb cidx = __builtin_convertvector(x, vNb);
    return __builtin_ctzll((sNb)cidx) / 8u;
}
TrentP
  • 4,240
  • 24
  • 35
  • 2
    Why do you search for the horizontal max of the vector inside the loop and not at the very end? I.e., also keep an 8-sized index register which contains the indexes of the maximum of all `k + 8*i` elements (for `k=0..8`). – chtz Jan 08 '22 at 00:16
  • You can alternatively play with movemask+popcount instructions for the any so to find the location faster (and do the check). I guess this could be faster, but this likely requires VLEN to be bigger. – Jérôme Richard Jan 08 '22 at 01:03
  • I thought of horizontal max at the end, but didn't see a way to keep track of which element was at which `i`. But vector index register, that works, and can avoid the `if(any(cmp))` (which I still would like to optimize). It becomes `max = cmp ? t : max;` and `imax = cmp ? i : imax;` Then at end find the horizontal max of `max` and the corresponding element of `imax`. – TrentP Jan 08 '22 at 01:25
  • I haven't found a way to get a movemask without using architecture specific intrinsics. movemask + ctz gives the best `which()` I've found and it would be efficient for `any()`. – TrentP Jan 08 '22 at 02:44
  • Your edit looks like something you should post as an *answer*, not part of the question. – Peter Cordes Jan 14 '22 at 22:44
  • I thought about it, but didn't think it added that much beyond Aki's answer, other than actually fully implementing and testing it. And what what I'm really after, a good implementation of "any()", is no better of than before. – TrentP Jan 15 '22 at 18:05
  • *didn't think it added that much beyond Aki's answer, other than actually fully implementing and testing it.* - That's still what answers are for. Although I see now that most of the stuff below the "update" header looks like attempts, which can be part of a question. But they could be moved to an answer along to unclutter the question some, along with the part that does fit better as an answer. Your question could mention your answer having some attempts, so readers don't have to scroll through so much question before getting to the answers. – Peter Cordes Jan 24 '22 at 02:39

3 Answers3

2

As commented by chtz, the most generic and typical method is to have another mask to gather indices:

Vec8s indices = { 0,1,2,3,4,5,6,7};
Vec8s max_idx = indices;
Vec8f max_abs = abs(load8(ptr)); 

for (auto i = 8; i + 8 <= vec_length; i+=8) { 
    Vec8s data = abs(load8(ptr[i]));
    auto mask = is_greater(data, max_abs);
    max_idx = bitselect(mask, indices, max_idx);
    max_abs = max(max_abs, data);    
    indices = indices + 8;
}

Another option is to interleave the values and indices:

auto data = load8s(ptr) & 0x7fffffff; // can load data as int32_t
auto idx = vec8s{0,1,2,3,4,5,6,7};
auto lo = zip_lo(idx, data);
auto hi = zip_hi(idx, data);

for (int i = 8; i + 8 <= size; i+=8) {
    idx = idx + 8;
    auto d1 = load8s(ptr + i) & 0x7fffffff;
    auto lo1 = zip_lo(idx, d1);
    auto hi1 = zip_hi(idx, d1);
    lo = max_u64(lo, lo1);
    hi = max_u64(hi, hi1);
}

This method is especially lucrative, if the range of inputs is small enough to shift the input left, while appending a few bits from the index to the LSB bits of the same word.

Even in this case we can repurpose 1 bit in the float allowing us to save one half of the bit/index selection operations.

auto data0 = load8u(ptr) << 1; // take abs by shifting left 
auto data1 = (load8u(ptr + 8) << 1) + 1;  // encode odd index to data
auto mx = max_u32(data0, data1);  // the LSB contains one bit of index

Looks like one can use double as the storage, since even SSE2 supports _mm_max_pd (some attention needs to be given to Inf/Nan handling, which don't encode as Inf/Nan any more when reinterpreted as the high part of 64-bit double).

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • I benchmarked having another vector to gather the indexes. This results in `vmovd` + `vpbroadcastd` to get index in a register, `vcmpps` + `vpblendvb` to update index vector, and `vmaxps` to update max data vector. It is these 5 instructions for a new max or not. But with `if(any(t > max))`, there is only `vcmpps`, `vtestps`, `je` when there isn't a new max. It's faster if there are few new maxes, i.e. data in descending order. – TrentP Jan 08 '22 at 09:25
  • 1
    @TrentP: There's no need for any broadcasting in the loop. Did you not make `idx` a vector of counts like Aki is showing, to be updated with `paddd`? I wanted to see how it would compile with Agner Fog's vectorclass wrappers that look somewhat like GNU C native vector syntax: https://godbolt.org/z/nG8Ys7hb1 shows a very reasonable branchless loop, 5 SIMD instructions (6 uops on SKL from un-lamination of the indexed addr mode on vandps), and 2 uops of loop overhead (add + macro-fused cmp/jcc). The bottleneck is the maxps dep chain; fix that by using more accumulators in parallel. – Peter Cordes Jan 08 '22 at 11:39
  • (Correction, vblendvps is 2 uops on Skylake, same as vpblendvb. The legacy-SSE versions of each are only 1 uop, but then the loop would have needed an extra movups somewhere. https://uops.info/) – Peter Cordes Jan 08 '22 at 14:59
  • No, I didn't use the vector of current indices. I'll rebenchmark that. That should trade `vmovd` + `vpbroadcastd` into `vpadd`. Better, still more instructions in the no new max case, but there is no branch and they are generally faster instructions. – TrentP Jan 09 '22 at 05:58
  • Still similar result, if there are few maxes, it faster to optimize the common case with `if(any(t > max))` than to keep track of max indexes in a vector and avoid the horizontal max in the loop. I think move and broadcast vs add doesn't really make much difference, because the broadcast's latency can be hidden. – TrentP Jan 10 '22 at 00:22
  • Just out of curiosity -- is the vector uniformly random, or is it hard to predict? The toughest case would probably be 50% probability of magnitude increasing. I don't have an intel system to try it on. – Aki Suihkonen Jan 10 '22 at 04:42
  • They are not random, but also not sorted in ascending order. If each value is i.i.d. then the max is approached quickly. The data I'm looking at isn't i.i.d. Approx. 1 out of 500 values is a new max for about 20 new maxes, after which the rate of new maxes is effectively zero. Of course the rate of new maxes starts high (1 out of 1) and drops to zero eventually, averaging to about 1 out of 500 by the time it reaches zero. – TrentP Jan 14 '22 at 19:30
2

UPD: the no-aligning issue is fixed now, all the examples on godbolt use aligned reads.

UPD: MISSED THE ABS

Terribly sorry about that, I missed the absolute value from the definition. I do not have the measurements, but here are all 3 functions vectorised:

Asm stashed in a gist

On the method

The way to do max element with simd is to first find the value and then find the index.

Alternatively you have to keep a register of indexes and blend the indexes. This requires keeping indexes, doing more operations and the problem of the overflow needs to be addressed.

Here are my timings on avx2 by type (char, short and int) for 10'000 bytes of data compare 2 methods

The min_element is my implementation of keeping the index. reduce(min) + find is doing two loops - first get the value, then find where.

For ints (should behave like floats), performance is 25% faster for the two loops solution, at least on my measurements.

For completeness, comparisons against scalar for both methods - this is definitely an operation that should be vectorized.

comparing to scalar

How to do it

finding the maximum value is auto-vectorised across all platforms if you write it as reduce

if (!arr.size()) return {};

// std::reduce is also ok, just showing for more C ppl
float res = arr[0];
for (int i = 1; i != (int)arr.size(); ++i) {
    res = res > arr[i] ? res : arr[i];
}

return res;

https://godbolt.org/z/EsazWf1vT

Now the find portion is trickier, non of the compilers I know autovectorize find

We have eve library that provides you with find algorithm: https://godbolt.org/z/93a98x6Tj

Or I explain how to implement find in this talk if you want to do it yourself.

UPD: UPD2: changed the blend to max

@Peter Cordes in the comments said that there is maybe a point to doing the one pass solution in case of bigger data. I have no evidence of this - my measurements point to reduce + find.

However, I hacked together roughly how keeping the index looks (there is an aligning issue at the moment, we should definitely align reads here) https://godbolt.org/z/44jYf8qPj

AVX2 main loop:

.L6:
        vmovups ymm6, YMMWORD PTR [rdx]
        add     rdx, 32
        vcmpps  ymm3, ymm6, ymm0, 30
        vmaxps  ymm0, ymm6, ymm0
        vpblendvb       ymm3, ymm2, ymm1, ymm3
        vpaddd  ymm1, ymm5, ymm1
        vmovdqa ymm2, ymm3
        cmp     rcx, rdx
        jne     .L6

ARM-64 main loop:

.L6:
        ldr     q3, [x0], 16
        fcmgt   v4.4s, v3.4s, v0.4s
        fmax    v0.4s, v3.4s, v0.4s
        bit     v1.16b, v2.16b, v4.16b
        add     v2.4s, v2.4s, v5.4s
        cmp     x0, x1
        bne     .L6

Links to ASM if godbolt becomes stale: https://gist.github.com/DenisYaroshevskiy/56d82c8cf4a4dd5bf91d58b053ea80f2

Denis Yaroshevskiy
  • 1,218
  • 11
  • 24
  • 1
    You're timing with small arrays, right? Where the 2nd loop over the data still hits in L1d cache. It would be different for large arrays, where memory or L3$ bandwidth is the bottleneck and you'd have enough CPU cycles per vector from memory to update a vector of indices. (And it's also makes sure you don't consume extra system-wide bandwidth.) So I guess maybe branch on size to pick a strategy, if your arrays are sometimes small, sometimes large? – Peter Cordes Jan 22 '22 at 19:57
  • Oh, interesting point about `char` and `short` data, too, where those are narrower than even an `int` index. Yeah that would favour 2-pass more, or for large arrays maybe a branchy vector strategy (only unpack and blend if the last vector contained any new maxes; with only 256 unique char values such a branch could be taken at worst 256 x 32 times, if each SIMD element saw a new max that's 1 higher, and never in the same vector. And usually probably a lot less) – Peter Cordes Jan 22 '22 at 20:04
  • 1
    For shorts I think it's fine - you get to 64k before overflow, so once every 64 k you deal with it. For chars one can convert to shorts as you say. I did measures completely ignoring the overflow, still the reduce + find was faster. I will post a version of code with indexes if you think it's beneficial. – Denis Yaroshevskiy Jan 22 '22 at 21:45
  • There are a lot of variations on this possible for different combinations of total array size vs. element size. If you have a bunch of working versions, a link to them on github or somewhere is probably better than cluttering up the answer. Although having *one* optimized version fully in your answer is probably a good idea. Or at least a "full" (not shortened) Godbolt link, to guard against link rot. (But good point about `short`; yeah an outer loop to sort out the every-64k stuff allows 16-bit indices in the inner loop; that's better than my first thought.) – Peter Cordes Jan 23 '22 at 00:08
  • Full version is a lot of work. I think a link to the explanation should be enough. – Denis Yaroshevskiy Jan 23 '22 at 00:51
  • Chars with short idxs look smth like this: https://godbolt.org/z/16reovn4h L7 is the main loop I believe. – Denis Yaroshevskiy Jan 23 '22 at 00:53
  • 1
    For chars, maybe cache-block your 2-loop idea: search up to 8K or 16KiB, then, if it contained a new `max` for any element, come back and look for the position like memchr. (Or actually call `memchr`, so you can use that existing target-optimized building block.) I guess that would require horizontal max in the outer loop to know which specific char to search for. Or 2k or 4k might be good tuning sizes. – Peter Cordes Jan 23 '22 at 01:23
  • 1
    That .L7 loop seems to be using `vpblendvb` (2 uops on most CPUs) on a `vpcmpgtb` compare result instead of `vpmaxb` which is only 1 uop and can run in parallel. So that's unfortunate. I think cache-blocked 2-pass will be best of both worlds for `char` data. (Unless the other logical thread bottlenecks on L1d bandwidth but not the shuffle port, and this max-finding isn't really on the critical path so it's ok to run it slower but more friendly to the other thread. IDK if there's ever a realistic use-case for doing that.) – Peter Cordes Jan 23 '22 at 01:34
  • > vpmaxb I did not know which one is better. clang does maxb and that's what I measured. Thx for the inside: https://godbolt.org/z/7r9oqvMrc – Denis Yaroshevskiy Jan 23 '22 at 01:51
  • 1
    `vpmaxsb` is clearly better on Skylake, Ice Lake, Alder Lake, and Zen2 at least. On SKL for example, which is probably the most common uarch around these days. 1 uop for p01, vs. 2 uops for p015. (Or fewer ports on Haswell). And on Alder Lake P-cores, it's 3 uops for p015; E-cores run it as 8 uops.. Zen2 runs it as a single uop, but only on one port, vs 3/clock for `vpmaxsb`. https://uops.info/table.html In general, *variable* blends are not as cheap as simple integer arithmetic compare or min/max, partly because they have 3 inputs (and a 4th operand for a separate output). – Peter Cordes Jan 23 '22 at 01:56
  • Updated to use max – Denis Yaroshevskiy Jan 23 '22 at 02:23
  • 1
    You aren't taking the max absolute value through. The need to do a computation changes the two pass method because now you need to do it twice. – TrentP Jan 23 '22 at 07:53
  • @TrentP you are absolutely right, added. – Denis Yaroshevskiy Jan 23 '22 at 11:34
  • 1
    I wanted to benchmark this vs the version in my answer. I've installed eve main branch (looks like eve devs don't use git tags, so can't provide get describe output) and your code is looking for `eve/views/iota.hpp`, which doesn't appear to be part of the latest eve. Does it need a specific version? – TrentP Jan 23 '22 at 21:38
  • Used `develop` branch of eve and got a build working. Also appears one can't use Eve on a `const float` span. Tested on Zen3, clang trunk (llvmorg-14-init-16962). Used a 1M element array, repeating operation 100k times, timed used performance counter cpu cycles, median of 10 runs, times in giga CPU cycles. My 1st version (branch), 21.75. 2nd version (max + index), 23.68. And your version, 21.14. Little bit faster! But, I think avoiding architecture specific intrinsics by using a library of architecture specific intrinsics might be cheating. – TrentP Jan 23 '22 at 23:36
  • 1
    @TrentP - please report the const issue? Should work. Also if you need help - feel free to report the issue too. https://github.com/jfalcou/eve/issues I will do generic min_element, max_element but it will take me a while (will update this question). BTW: cheating is good come on. – Denis Yaroshevskiy Jan 24 '22 at 13:44
  • @TrentP - fixed the const issue. That godbolt implementation is not terribly tested to say the least. – Denis Yaroshevskiy Jan 24 '22 at 15:07
1

I don’t believe that’s possible. Compilers aren’t smart enough to do that efficiently.

Compare the other answer (which uses NEON-like pseudocode) with the SSE version below:

// Compare vector absolute value with aa, if greater update both aa and maxIdx
inline void updateMax( __m128 vec, __m128i idx, __m128& aa, __m128& maxIdx )
{
    vec = _mm_andnot_ps( _mm_set1_ps( -0.0f ), vec );
    const __m128 greater = _mm_cmpgt_ps( vec, aa );
    aa = _mm_max_ps( vec, aa );
    // If you don't have SSE4, emulate with bitwise ops: and, andnot, or
    maxIdx = _mm_blendv_ps( maxIdx, _mm_castsi128_ps( idx ), greater );
}

float maxabs_sse4( const float* rsi, size_t length, size_t& index )
{
    // Initialize things
    const float* const end = rsi + length;
    const float* const endAligned = rsi + ( ( length / 4 ) * 4 );

    __m128 aa = _mm_set1_ps( -1 );
    __m128 maxIdx = _mm_setzero_ps();
    __m128i idx = _mm_setr_epi32( 0, 1, 2, 3 );

    // Main vectorized portion
    while( rsi < endAligned )
    {
        __m128 vec = _mm_loadu_ps( rsi );
        rsi += 4;
        updateMax( vec, idx, aa, maxIdx );
        idx = _mm_add_epi32( idx, _mm_set1_epi32( 4 ) );
    }

    // Handle the remainder, if present
    if( rsi < end )
    {
        __m128 vec;
        if( length > 4 )
        {
            // The source has at least 5 elements
            // Offset the source pointer + index back, by a few elements
            const int offset = (int)( 4 - ( length % 4 ) );
            rsi -= offset;
            idx = _mm_sub_epi32( idx, _mm_set1_epi32( offset ) );
            vec = _mm_loadu_ps( rsi );
        }
        else
        {
            // The source was smaller than 4 elements, copy them into temporary buffer and load vector from there
            alignas( 16 ) float buff[ 4 ];
            _mm_store_ps( buff, _mm_setzero_ps() );
            for( size_t i = 0; i < length; i++ )
                buff[ i ] = rsi[ i ];
            vec = _mm_load_ps( buff );
        }

        updateMax( vec, idx, aa, maxIdx );
    }

    // Reduce to scalar
    __m128 tmpMax = _mm_movehl_ps( aa, aa );
    __m128 tmpMaxIdx = _mm_movehl_ps( maxIdx, maxIdx );
    __m128 greater = _mm_cmpgt_ps( tmpMax, aa );
    aa = _mm_max_ps( tmpMax, aa );
    maxIdx = _mm_blendv_ps( maxIdx, tmpMaxIdx, greater );

    // SSE3 has 100% market penetration in 2022
    tmpMax = _mm_movehdup_ps( tmpMax );
    tmpMaxIdx = _mm_movehdup_ps( tmpMaxIdx );
    greater = _mm_cmpgt_ss( tmpMax, aa );
    aa = _mm_max_ss( tmpMax, aa );
    maxIdx = _mm_blendv_ps( maxIdx, tmpMaxIdx, greater );

    index = (size_t)_mm_cvtsi128_si32( _mm_castps_si128( maxIdx ) );
    return _mm_cvtss_f32( aa );
}

As you see, pretty much everything is completely different. Not just the boilerplate about remainder and final reduction, the main loop is very different too.

SSE doesn’t have bitselect; blendvps is not quite that, it selects 32-bit lanes based on high bit of the selector. Unlike NEON, SSE doesn’t have instructions for absolute value, need to be emulated with bitwise andnot.

The final reduction going to be completely different as well. NEON has very limited shuffles, but it has better horizontal operations, like vmaxvq_f32 which finds horizontal maximum over the complete SIMD vector.

Soonts
  • 20,079
  • 9
  • 57
  • 130
  • I don't think the inner loop itself is *fundamentally* different. SIMD conditional select is the same thing whether it's bitwise or per-element when the selector is a compare result that's 0 / -1. The cleanup after the loop might well have to be dumbed down to something portable, maybe non-SIMD like maybe just store and scalar loop over an array of 8 elements. – Peter Cordes Jan 08 '22 at 13:42
  • (`vmaxvq_f32` won't tell you which index the max came from so it's not usable alone; did you have in mind doing `bcast_max == aa` to select matching FP elements, then AND that with `maxIdx` to find slots that had that maximum, then horizontal integer min to find the first occurrence of that max in the input array?) – Peter Cordes Jan 08 '22 at 13:45
  • @PeterCordes I can agree the difference is not fundamental, but it’s very substantial, still. Maybe it’s technically possible to autovectorize, but AFAIK C++ compilers which we have today don’t quite cut it. About final reduction for NEON — horizontal maximum, broadcast, FP32 compare for less than, bitwise OR with the found indices vector, and finally `vminvq_u32` for horizontal integer minimum, that’s 5 instructions versus 10 for SSE. – Soonts Jan 08 '22 at 14:16
  • 1
    Abs (for maximum comparison) can be emulated also by the proposed trick of reinterpret float as int32_t, shift left by 1; also interleaving the int32_t with int32_t (`punpack{lo|hi}_epi32`) index to produce a `double` which can be compared in `_mm_max_pd` would be supported down in SSE2. -- that would be 6 instructions in both SSE2 and Neon. – Aki Suihkonen Jan 08 '22 at 14:17
  • @PeterCordes BTW, not only NEON has instructions for absolute value, it also has sign-agnostic comparison instructions who only checking absolute values. These SIMD instruction sets only similar on the surface; they have lots of differences on the lower level. – Soonts Jan 08 '22 at 14:35
  • @AkiSuihkonen It assumes there’s no INF or NAN values? – Soonts Jan 08 '22 at 14:35
  • 1
    Well, Inf and Nans would be ordered so that NaN > Inf, but then OTOH the _abs_ needs not to shift the sign bit out. 0x7f800000 == Inf in float would be 0x7f80000000000000 in double, which would be just a large number. – Aki Suihkonen Jan 08 '22 at 14:37
  • Yeah, at best you're only using lowest-common-denominator functionality when trying to vectorize portably like this, without different strategies for each ISA. Nobody said it would be optimal, just much better than scalar as long as the size is large enough that a simplistic cleanup isn't a significant portion of the total time. – Peter Cordes Jan 08 '22 at 14:38
  • @PeterCordes I know nothing about the problem OP’s solving, but since they wrote “…that can compile efficiently”, I assumed they want a solution which ain’t too far from being optimal. – Soonts Jan 08 '22 at 14:41
  • How much better would a NEON-specific inner loop be? Like saving one AND instruction by using magnitude compares instead of having to abs() separately? It's not on the critical path for latency, although it's less latency to hide if we unroll to hide FP max latency. (Does NEON have max-abs as a single instruction, or would you have to blend on the compare result, making the dep chain longer if you wanted to avoid ever doing abs?) Anyway, doesn't seem too far from optimal for the strategy that involves FP, rather than shifting the bit-patterns. – Peter Cordes Jan 08 '22 at 14:50
  • With @Aki's idea of left-shift by 1 and then working in the integer domain, that could shorten the critical path latency if integer compare and blend based on that compare were less total latency than an FP max. (e.g. AVX-512 `vpcmpud` unsigned integer compare into mask is apparently 4 cycle latency on SKX, 3 cycle latency on ICL, and only 1/clock. But blend of maxval and maxidx is single cycle latency so a total of 4, matching `vmaxps` :/ Unlike 1 cycle, 2/clock for AVX2 compare into vector like `vpcmpgtd` which would be usable after masking away the sign bit. But vblendvps is 2 uops.) – Peter Cordes Jan 08 '22 at 14:57
  • @PeterCordes It might be slightly offtopic, but on my particular CPU (Zen 3), FP max or FP comparisons don’t have any latency, they complete in 1 cycle. – Soonts Jan 08 '22 at 15:26
  • Oh interesting, yeah apparently AMD since Zen1 at least has 1 cycle latency `maxps` according to https://uops.info/. (It's one cycle, not zero! Only an eliminated mov like `movaps xmm0, xmm1` has no latency.) I wonder if AMD is basically using integer comparators for min/max/cmp. And while I was looking, I noticed Alder Lake P-cores take 3 uops for `vblendvps` up from 2 on ICL, while the E-cores take 4 uops with 5 cycle latency for the VEX encoding, but only 1 uop for the legacy-SSE encoding on both P core and E core. – Peter Cordes Jan 08 '22 at 15:28
  • I did mean something not too far from optimal. If you look at the asm on godbolt (linked in Q), clang does get pretty close to optimal for AVX. Only `which()` is poor, but it can moved outside the loop, so that gives one the "not too far" part. It's really only `any()` on NEON that isn't working well. – TrentP Jan 09 '22 at 05:40