AVX2 Winner-Take-All Disparity Search

Question

I am optimizing the "winner-take-all" portion of a disparity estimation algorithm using AVX2. My scalar routine is accurate, but at QVGA resolution and 48 disparities the runtime is disappointingly slow at ~14 ms on my laptop. I create both LR and RL disparity images, but for simplicity here I will only include code for the RL search.

My scalar routine:

int MAXCOST = 32000;
for (int i = maskRadius; i < rstep-maskRadius; i++) {

    // WTA "RL" Search:
    for (int j = maskRadius; j+maskRadius < cstep; j++) {
        int minCost = MAXCOST;
        int minDisp = 0;
        for (int d = 0; d < numDisp && j+d < cstep; d++) {
            if (asPtr[(i*numDisp*cstep)+(d*cstep)+j] < minCost) {
                minCost = asPtr[(i*numDisp*cstep)+(d*cstep)+j];
                minDisp = d;
            }
        }
        dRPtr[(i*cstep)+j] = minDisp;
    }
}

My attempt at using AVX2:

int MAXCOST = 32000;
int* dispVals = (int*) _mm_malloc( sizeof(int32_t)*16, 32 );

for (int i = maskRadius; i < rstep-maskRadius; i++) {

    // WTA "RL" Search AVX2:
    for( int j = 0; j < cstep-16; j+=16) {

        __m256i minCosts = _mm256_set1_epi16( MAXCOST );
        __m128i loMask   = _mm_setzero_si128();
        __m128i hiMask   = _mm_setzero_si128();

        for (int d = 0; d < numDisp && j+d < cstep; d++) {
            // Grab 16 costs to compare
            __m256i costs = _mm256_loadu_si256((__m256i*) (asPtr[(i*numDisp*cstep)+(d*cstep)+j]));

            // Get the new minimums
            __m256i newMinCosts = _mm256_min_epu16( minCosts, costs );

            // Compare new mins to old to build mask to store minDisps
            __m256i mask   = _mm256_cmpgt_epi16( minCosts, newMinCosts );
            __m128i loMask = _mm256_extracti128_si256( mask, 0 );
            __m128i hiMask = _mm256_extracti128_si256( mask, 1 );
            // Sign extend to 32bits
            __m256i loMask32 = _mm256_cvtepi16_epi32( loMask );
            __m256i hiMask32 = _mm256_cvtepi16_epi32( hiMask );

            __m256i currentDisp = _mm256_set1_epi32( d );
            // store min disps with mask
            _mm256_maskstore_epi32( dispVals, loMask32, currentDisp );    // RT error, why?
            _mm256_maskstore_epi32( dispVals+8, hiMask32, currentDisp );  // RT error, why?

            // Set minCosts to newMinCosts
            minCosts = newMinCosts;
        }

        // Write the WTA minimums one-by-one to the RL disparity image
        int index = (i*cstep)+j;
        for( int k = 0; k < 16; k++ ) {
            dRPtr[index+k] = dispVals[k];
        }
    }
}
_mm_free( dispVals );

The Disparity Space Image (DSI) is of size HxWxD (320x240x48), which I lay out horizontally for better memory accesses, such that each row is of size WxD.

The Disparity Space Image has per-pixel matching costs. This aggregated with a simple box filter to make another image of the exact same size, but with costs summed over, say, a 3x3 or 5x5 window. This smoothing makes the result more 'robust'. When I am accessing with asPtr, I am indexing into this aggregated costs image.

Also, in an effort to save on unnecessary computation, I have been starting and ending on rows offset by a mask radius. This mask radius is the radius of my census mask. I could be doing some fancy border reflection, but it is simpler and faster just to not bother with the disparity for this border. This of course applies to the beginning and ending cols too, but messing with indexing here is not good when I am forcing my entire algorithm to run only on images whose columns are a multiple of 16 (ex. QVGA: 320x240) so that I can index simply and hit everything with SIMD (no residual scalar processing).

Also, if you think my code is a mess, I encourage you to check out the the highly optimized OpenCV stereo algorithms. I find them impossible and have been able to make little to no use of them.

My code compiles but fails at runtime. I am using VS 2012 Express Update 4. When I run with the debugger I am unable to gain any insights. I am relatively new to using intrinsics and so I am not sure what information I should expect to see when debugging, number of registers, whether __m256i variables should be visible, etc.

Heeding comment advice below, I improved the scalar time from ~14 to ~8 by using smarter indexing. My CPU is an i7-4980HQ and I successfully use AVX2 intrinsics elsewhere in the same file.

Info Image

Are you sure this loop is correct: `for( int j = 0; j < cstep-16; j+=16)` - it looks like you might be missing out the last 16 values ? — Paul R, May 21 '15 at 06:10
to speed things up, especially in the inner most loop, the repeated expressions should be extracted and only performed once. For instance, the d * cstep expression only needs to be performed once in each inner loop, similarly with i * numDisp * cstep — user3629249, May 22 '15 at 21:21
Not that it matters now that you have a working version, but I would have tried to figure out what exactly was causing the runtime error by looking at the disassembly, and the state of registers when the error was triggered. So you could tell if the pointer being used for `_mm256_maskstore_epi32` was correct, and the memory pointed to is accessable via the debugger. Probably it was some problem with either a wrong pointer (C pointer math giving unexpected results?) or a malloc failure. "Runtime error" could be anything, from illegal instruction to bad pointer, so it's not that useful. — Peter Cordes, Jun 12 '15 at 21:17

Peter Cordes · Accepted Answer · 2015-07-25T02:43:34.250

I still haven't found the problem, but I did see some things you might want to change. You're not checking the return value of _mm_malloc, though. If it's failing, that would explain it. (Maybe it doesn't like allocating 32-byte aligned memory?)

If you're running your code under a memory checker or something, then maybe it doesn't like reading from uninitialized memory for dispVals. (_mm256_maskstore_epi32 may count as a read-modify-write even if the mask is all-ones.)

Run your code under a debugger and find out what's going wrong. "runtime error" is not very meaningful.

_mm_set1* functions are slow-ish. VPBROADCASTD needs its source in memory or a vector reg, not a GP reg, so the compiler can either movd from a GP reg to a vector reg and then broadcast, or store to memory and then broadcast. Anyway, it would be faster to do

const __m256i add1 = _mm256_set1_epi32( 1 );
__m256i dvec = _mm256_setzero_si256();
for (d;d...;d++) {
    dvec = _mm256_add_epi32(dvec, add1);
}

Other stuff:
This will probably run faster if you aren't storing to memory every iteration of the inner loop. Use a blend instruction (_mm256_blendv_epi8), or something like that, to update the vector(s) of displacements that go with the min costs. Blend = masked move with a register destination.

Also, your displacement values should fit in 16b integers, so don't sign-extend them to 32b until AFTER you're done finding them. Intel CPUs can sign-extend a 16b memory location into gp register on the fly with no speed penalty (movsz is as fast as mov), so prob. just declare your dRPtr array as uint16_t. Then you don't need the sign-extending stuff in your vector code at all (let alone in your inner loop!). Hopefully _mm256_extracti128_si256( mask, 0 ) compiles to nothing, since the 128 you want is already the low128, so just use the reg as the src for vmovsx, but still.

You can also save an instruction (and a fused-domain uop) by not loading first. (unless the compiler is smart enough not to elide the vmovdqu and use vpminuw with a memory operand, even though you used the load intrinsic).

So I'm thinking something like this:

// totally untested, didn't even check that this compiles.
for(i) { for(j) {
// inner loop, compiler can hoist these constants.
const __m256i add1 = _mm256_set1_epi16( 1 );
__m256i dvec = _mm256_setzero_si256();
__m256i minCosts = _mm256_set1_epi16( MAXCOST );
__m256i minDisps = _mm256_setzero_si256();

for (int d=0 ; d < numDisp && j+d < cstep ;
     d++, dvec = _mm256_add_epi16(dvec, add1))
{
    __m256i newMinCosts = _mm256_min_epu16( minCosts, asPtr[(i*numDisp*cstep)+(d*cstep)+j]) );
    __m256i mask   = _mm256_cmpgt_epi16( minCosts, newMinCosts );
    minDisps = _mm256_blendv_epi8(minDisps, dvec, mask); // 2 uops, latency=2
    minCosts = newMinCosts;
}

// put sign extension here if making dRPtr uint16_t isn't an option.
int index = (i*cstep)+j;
_mm256_storeu_si256 (dRPtr + index, __m256i minDisps);
}}

You might get better performance having two parallel dependency chains: minCosts0 / minDisps0, and minCosts1 / minDisps1, and then combining them at the end. minDisps is a loop-carried dependency, but the loop only has 5 instructions (including the vpadd, which looks like loop overhead but can't be reduced by unrolling). They decode to 6 uops (blendv is 2), plus loop overhead. It should run in 1.5cycles / iteration (not counting loop overhead) on haswell, but the dep chain would limit it to one iteration per 2 cycles. (Assuming unrolling to get rid of loop overhead). Doing two dep chains in parallel fixes this, and has the same effect as unrolling the loop: less loop overhead.

Hmm, actually on Haswell,

pminuw can run on p1/p5. (and the load part on p2/p3)
pcmpgtw can run on p1/p5
vpblendvb is 2 uops for p5.
padduw can run on p1/p5
movdqa reg,reg can run on p0/p1/p5 (and may not need an execution unit at all). Unrolling should get rid of any overhead for minCosts = newMinCosts, since the compiler can just end up with newMinCosts from the last unrolled loop body in the right register for the first loop body of the next iteration.
fused sub / jge (loop counter) can run on p6. (using PTEST + jcc on dvec would be slower). add/sub can run on p0/p1/p5/p6 when not fused with a jcc.

Ok, so actually the loop will take 2.5 cycles per iteration, limited by instructions that can only run on p1/p5. Unrolling by 2 or 4 will reduce the loop / movdqa overhead. Since Haswell can issue 4 uops per clock, it can then more efficiently queue up uops for out-of-order execution, since the loop won't have a super-high number of iterations. (48 was your example.) Having lots of uops queued up will give the CPU something to do after leaving the loop, and hide any latencies from cache misses, etc.

_mm256_min_epu16 (PMINUW) is another loop-carried dependency chain. Using it with a memory operand makes it a 3 or 4-cycle latency. However, the load part of the instruction can start as soon as the address is known, so folding a load into a modify op to take advantage of micro-fusion doesn't make dep chains any longer or shorter than using a separate load.

Sometimes you need to use a separate load, for unaligned data (AVX removed the alignment requirement for memory operands). We're limited more by execution units than the 4 uop / clock issue limit, so it's probably fine to use a dedicated load instruction.

source for insn ports / latencies.

Thank you so much! Went from ~14ms to ~1ms. Huge savings. Never did find the error with my original code. — Der Luftmensch, Jun 12 '15 at 01:58
Thanks for editting your question to describe what your code is actually doing. I figured it was indexing into some kind of cost array, but having the details there might make it useful for someone else. Glad I could help with tightening up the inner loop. — Peter Cordes, Jun 12 '15 at 21:07

score 2 · Answer 2 · answered Jun 08 '15 at 08:36

2

Before you go and do platform specific optimizations, there are plenty of portable optimizations that could be performed. Extract loop invariants, convert index multiplies to increment additions, etc...

This may not be exact, but gets the general idea across:

int MAXCOST = 32000, numDispXcstep = numDisp*cstep;
for (int i = maskRadius; i < rstep - maskRadius; i+=numDispXcstep) {
    for (int j = maskRadius; j < cstep - maskRadius; j++) {
        int minCost = MAXCOST, minDisp = 0;
        for (int d = 0; d < numDispXcstep - j; d+=cstep) {
            if (asPtr[i+j+d] < minCost) {
                minCost = asPtr[i+j+d];
                minDisp = d;
            }
        }
        dRPtr[i/numDisp+j] = minDisp;
    }
}

Once you have done this it becomes apparent what is actually occurring. It looks like "i" is the largest step, followed by "d" with "j" actually being the variable that operates on sequential data. ... the next step would be to reorder the loops accordingly and if you still need further optimizations, apply platform specific intrinsics.

answered Jun 08 '15 at 08:36

technosaurus

7,676
1
30
52

Hoisting loop invariant expressions out of the loop certainly makes the code easier to read in this case, but often the compiler can do that for you. If the expressions involve anything other than local variables, though, the compiler can't assume they don't change. Anyway, the array index was a real mess before. I didn't spend much time trying to unravel it for my answer, beyond checking that it was vectorized over consecutive source data. – Peter Cordes Jun 10 '15 at 05:24
@technosaurus I improved the indexing and got a big boost in speed, and will edit the question to show this when I have another free moment. Given that the data is an Hx(WxD) array, such that the "images" are side-by-side instead of stacked, I think it makes sense to go i->j->d and not save the minCosts to a temporary array of size W (for all j). This way the temp data never leave registers. Otherwise, absolutely, indexing i->d->j is more appropriate and is what I do when computing the costs and saving them to the DSI. I plan to try it out i->d->j on the scalar routine and report back. – Der Luftmensch Jun 10 '15 at 14:54
@DerLuftmensch My reasoning for keeping the sequential data together was that intrinsic ops tend to be better optimized ... If you look into their typedefs and your data alignment, it may become obvious that you can simply do a cast as opposed to using the intrinsic functions for initializing variable. – technosaurus Jun 12 '15 at 18:03

AVX2 Winner-Take-All Disparity Search

2 Answers2