I am optimizing the "winner-take-all" portion of a disparity estimation algorithm using AVX2. My scalar routine is accurate, but at QVGA resolution and 48 disparities the runtime is disappointingly slow at ~14 ms on my laptop. I create both LR and RL disparity images, but for simplicity here I will only include code for the RL search.
My scalar routine:
int MAXCOST = 32000;
for (int i = maskRadius; i < rstep-maskRadius; i++) {
// WTA "RL" Search:
for (int j = maskRadius; j+maskRadius < cstep; j++) {
int minCost = MAXCOST;
int minDisp = 0;
for (int d = 0; d < numDisp && j+d < cstep; d++) {
if (asPtr[(i*numDisp*cstep)+(d*cstep)+j] < minCost) {
minCost = asPtr[(i*numDisp*cstep)+(d*cstep)+j];
minDisp = d;
}
}
dRPtr[(i*cstep)+j] = minDisp;
}
}
My attempt at using AVX2:
int MAXCOST = 32000;
int* dispVals = (int*) _mm_malloc( sizeof(int32_t)*16, 32 );
for (int i = maskRadius; i < rstep-maskRadius; i++) {
// WTA "RL" Search AVX2:
for( int j = 0; j < cstep-16; j+=16) {
__m256i minCosts = _mm256_set1_epi16( MAXCOST );
__m128i loMask = _mm_setzero_si128();
__m128i hiMask = _mm_setzero_si128();
for (int d = 0; d < numDisp && j+d < cstep; d++) {
// Grab 16 costs to compare
__m256i costs = _mm256_loadu_si256((__m256i*) (asPtr[(i*numDisp*cstep)+(d*cstep)+j]));
// Get the new minimums
__m256i newMinCosts = _mm256_min_epu16( minCosts, costs );
// Compare new mins to old to build mask to store minDisps
__m256i mask = _mm256_cmpgt_epi16( minCosts, newMinCosts );
__m128i loMask = _mm256_extracti128_si256( mask, 0 );
__m128i hiMask = _mm256_extracti128_si256( mask, 1 );
// Sign extend to 32bits
__m256i loMask32 = _mm256_cvtepi16_epi32( loMask );
__m256i hiMask32 = _mm256_cvtepi16_epi32( hiMask );
__m256i currentDisp = _mm256_set1_epi32( d );
// store min disps with mask
_mm256_maskstore_epi32( dispVals, loMask32, currentDisp ); // RT error, why?
_mm256_maskstore_epi32( dispVals+8, hiMask32, currentDisp ); // RT error, why?
// Set minCosts to newMinCosts
minCosts = newMinCosts;
}
// Write the WTA minimums one-by-one to the RL disparity image
int index = (i*cstep)+j;
for( int k = 0; k < 16; k++ ) {
dRPtr[index+k] = dispVals[k];
}
}
}
_mm_free( dispVals );
The Disparity Space Image (DSI) is of size HxWxD (320x240x48), which I lay out horizontally for better memory accesses, such that each row is of size WxD.
The Disparity Space Image has per-pixel matching costs. This aggregated with a simple box filter to make another image of the exact same size, but with costs summed over, say, a 3x3 or 5x5 window. This smoothing makes the result more 'robust'. When I am accessing with asPtr, I am indexing into this aggregated costs image.
Also, in an effort to save on unnecessary computation, I have been starting and ending on rows offset by a mask radius. This mask radius is the radius of my census mask. I could be doing some fancy border reflection, but it is simpler and faster just to not bother with the disparity for this border. This of course applies to the beginning and ending cols too, but messing with indexing here is not good when I am forcing my entire algorithm to run only on images whose columns are a multiple of 16 (ex. QVGA: 320x240) so that I can index simply and hit everything with SIMD (no residual scalar processing).
Also, if you think my code is a mess, I encourage you to check out the the highly optimized OpenCV stereo algorithms. I find them impossible and have been able to make little to no use of them.
My code compiles but fails at runtime. I am using VS 2012 Express Update 4. When I run with the debugger I am unable to gain any insights. I am relatively new to using intrinsics and so I am not sure what information I should expect to see when debugging, number of registers, whether __m256i variables should be visible, etc.
Heeding comment advice below, I improved the scalar time from ~14 to ~8 by using smarter indexing. My CPU is an i7-4980HQ and I successfully use AVX2 intrinsics elsewhere in the same file.