AVX512 Performance Recommendation on Bit Test and Operation

Question

I've met a set of code with the following "kernel" as performance blocker. Since I have access to the latest Intel(R) Xeon Phi(TM) CPU 7210 (KNL), I wish to speed it up using AVX512 intrinsic.

for( int y = starty; y <= endy; y++)
{
    // hence data[][] is "unsigned char" while result[] is "int"
    for( int x = startx; x <= endx; x++)
    {
        if( (data[y][x]&0x1) == 0 )
            result[x] += data[y][x];
    }
}

after analyzing the code's behavior, I've found that the inner loop's length is mostly less than 16, so that I've written the following

register int xlen = xend - xstart + 1;

__m512i zero5 = _mm512_setzero_si512();
__m256i zero2 = _mm512_castsi512_si256(zero5);
__m128i zero1 = _mm512_castsi512_si128(zero5);
__m256i mask2 = _mm256_set1_epi8(0x1);
__m128i mask1 = _mm256_castsi256_si128(mask2);

register __m512i psprof0 = zero5;

for( int i = 0; i < (16-xlen)&(~0x1); i += 2 ) mask1 = _mm_srli_si128(mask1, 2);
if( (16-xlen)&(0x1) ) mask1 = _mm_srli_si128(mask1, 1);

#pragma vector nontemporal
#pragma prefetch data
for( int y = starty; y <= endy; y++ )
{
    __m128i pixel16  = _mm_loadu_si128((__m128i*)&data[y][startx]);

    // if ( _mm_testc_si128(pixel16, mask1) ) continue;

    __m128i mask16    = _mm_andnot_si128(pixel16, mask1);
    __m128i pixel16n  = _mm_sign_epi8(pixel16, mask16);
            psprof0   = _mm512_add_epi32(psprof0, _mm512_cvtepu8_epi32(pixel16n));
}

_mm512_storeu_si512(&result[startx], psprof0);

A few questions here:

Since _mm_srli_si128 does not accept non-immediate parameter, I have to use a loop there, is there any way to eliminate it please?
_mm_testc_si128(pixel16, mask1) mostly does not help with performance, which of course is due to the distribution of data[][]; However, it "Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0", is there any way to get the result of "ANDNOT" so that I do not need to calculate _mm_andnot_si128 again?
Since the inner loop length is mostly less than 16, it might not be well-suited for AVX512; However, will it be worthy to unroll y interval by 2, by loading data[y][x] and data[y+1][x], and then combining them into one __m256i make sense? However, since 8bit int to 16bit int conversion is not yet available on KNL (AVX512BW), it might be more frustrating than the current version.
In general, any recommendations/suggestions to enhance performance on this small segment of code on KNL is highly appreciated :) (It is already within OpenMP loop region so that might not be available now)

Point 3 above:

static inline __m256i vec_256_combine_128(__m128i a, __m128i b)
{
// combine two __m128i into one __m256i
return _mm256_insertf128_si256(_mm256_castsi128_si256(a), b, 1);
}

static inline __m128i vec_256_add_128(__m256i a)
{
// add lower 128bit and higher 128bit of __m256i consists of epi16
return _mm_add_epi16(_mm256_castsi256_si128(a), _mm256_extracti128_si256(a, 1));
} 

for( int y = starty; y <= endy; y += 2 )
{
    __m128i pixel16a  = _mm_load_si128((__m128i*)&pEdgeImage[y][sx]);
    __m128i pixel16b  = _mm_load_si128((__m128i*)&pEdgeImage[y+1][sx]);
    if ( y == ye ) 
        pixel16b  = zero1;  

    __m256i pixel16   = vec_256_combine_128(pixel16a, pixel16b);

    if ( _mm256_testc_si256(pixel16, mask1) ) continue;

    __m256i mask16    = _mm256_andnot_si256(pixel16, mask1);
    __m256i pixel16n  = _mm256_sign_epi8(pixel16, mask16);

    __m256i pixel16lo = _mm256_unpacklo_epi8(pixel16n, zero2);
    __m256i pixel16hi = _mm256_unpackhi_epi8(pixel16n, zero2);

            psprof0   = _mm256_add_epi16(psprof0, vec_256_combine_128(vec_256_add_128(pixel16lo), vec_256_add_128(pixel16hi)));
}

So you are getting the sum of all the bits in a YxX region of Data? Are you open to mixed language? — Holmz, Nov 10 '16 at 12:16
@Holmz I am getting sum of all 8bit int in a (X x Y) region vertically (Y-direction) under certain condition (test LSB of each 8bit int to determine whether to include it in the sum). I'm open to all suggestions and recommendations :) — veritas, Nov 10 '16 at 19:56
The IBTEST and various other functions in Fortran90 are pretty good. With ISO_C_BINDING and BIND(C) it is pretty easy to take the best of both worlds and mash them up. As the F90 seems to optimise by itself pretty well usually one does not need to resort assembler or AVX intrinsics. — Holmz, Nov 11 '16 at 14:07
I have some code that sums ~250 MB/second of 8-bit and 16-bit words on a single core. It is using !$OMP SIMD REDUCTION(+:SumI). If you are passing in a vector or an array it is pretty easy. I would suggest that you build and test with the pure F90, and then work on the interfacing and testing with C . Advise if this is of interest. — Holmz, Nov 12 '16 at 07:46
@Holmz Thanks I will test it out, though 1. the management thinks mixing with FORTRAN might not be their preference; 2. Do you think BTEST(data[y][x], 0) is much faster than say ~data[y][x]&0x1 in C? Since in testing I've found the intrinsic version to be much faster than the original C version already. I understand that C sometimes is slower than FORTRAN when it needs to load the whole array frequently but this one does not look like it. If possible could you please share some code samples? Thanks again! — veritas, Nov 14 '16 at 21:54
It is certainly easy in F90 both to code it as well as get it to vectorise. In this case f90 has IBITS and BTEST intrinsics. I'll dig up the code example. There was also an intel webinar a while back where a grey-beard showed some c and f90 examples on how to get various tricky codes to vectorize. Are you using icc?, and do you have ifort? On the KNL intel makes it pretty holistic, and most of the HPC world is using mixed language... including on the KNL. It should also all work on gfortran and gcc. Do you use Intel's vtune? — Holmz, Nov 15 '16 at 08:21
@Holmz Yes I have vtune_amplifier_xe_2017.0.2.478468; I have icc but not ifort (I have gfortran 4.8.5 though). Now I mostly use Intel Intrinsics Guide — veritas, Nov 15 '16 at 20:13
So when you type 'which ifort' p, you get "ifort not found"? — Holmz, Nov 15 '16 at 20:42
@Holmz You are correct, I don't have ifort under /opt/intel/bin/, when I source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64, which ifort still returns no ifort in ... I have to test it out using gfortran then but many thanks in advance — veritas, Nov 16 '16 at 19:09
You have to source ifortvars, but there will be fortran stuff in there if you have it... And you may not have it. — Holmz, Nov 16 '16 at 21:01
Unfortunately no I do not have it :( I can request it installed though, will try to figure it out. Thanks! — veritas, Nov 16 '16 at 21:51

score 0 · Accepted Answer · answered Nov 16 '16 at 14:25

Here is a linear version that return a normalized bit count (8 floats), where it is normalized by the number of the entries. (Poked in by hand so likely a typo or two)

PURE FUNCTION BITS8(nGot, DataIn, nBits) BIND(C, NAME='BITS8')
USE OMP_LIB
USE, INTRINSIC :: IOS_C_BINDING, ONLY: C_FLOAT, C_INT8_Y, C_INT
IMPLICIT NONE
INTEGER(C_INT)                    , INTENT(IN) :: nGot
INTEGER(C_INT8_T)                 , INTENT(IN) :: nBits !Which should come in as 8
INTEGER(C_INT8_T), DIMENSION(nGot), INTENT(IN) :: DataIn
!DIR$ ATTRIBUTES ALIGN : 64                    :: Bits
REAL(C_FLOAT), DIMENSION(nBits)                :: Bits8

Bits8 = 0.0E0

!DIR$ ASSUME_ALIGNED DataIn:2
!DIR$ PREFETCH       DataIn:1:64
Sum_Loop: DO I = 1, nGot
!$OMP SIMD REDUCTION(+:Bits8) LINEAR(DataIn) SAFELEN(64)
  Bit_Loop: DO J = 0, nBits-1
    Bits8(J+1) = IBITS(DataIn(I),J, 1) + Bits8(J+1)
  ENDDO Bit_Loop
ENDDO Sum_Loop
!$OMP END

!DIR$ SIMD
Norm_Loop: DO J = 1, nBits
  Bits8(J) = Bits8(J)/nGot
ENDDO Norm_Loop

RETURN
END FUNCTION Bits8

You compile it with 'ifort -openmp -O3' etc. Obviously for a 2 array you will need #rows and # columns, and where the start and end of the rows and columns you want to check are located. I am sure you know that the rows and columns are reversed in c versus fortran.

To work out the trailing underscore dramas use 'nm' on the .o files, and the BIND(C, NAME=) can also help.

Probably you could use something even more skinnied down, and then in your C inline the function, and put the SIMD REDUCTION on the C side. You have the advantage of not having any concern with the row/column difference if you working through the array is on the 'c side'.

PURE FUNCTION BITS8(DataIn) BIND(C, NAME='BITS8')
USE OMP_LIB
USE, INTRINSIC :: IOS_C_BINDING, ONLY: C_INT8_T, C_INT
IMPLICIT NONE
INTEGER(C_INT8_T), PARAMETER      :: nBits = 8
INTEGER(C_INT8_T)    , INTENT(IN) :: DataIn
INTEGER(C_INT), DIMENSION(nBits)  :: Bits8

! Bits = 0.0E0

!DIR$ ASSUME_ALIGNED DataIn:2
!DIR$ PREFETCH       DataIn:1:64
Bit_Loop: DO J = 0, nBits-1
  Bits8(J+1) = IBITS(DataIn(I),J, 1)
ENDDO Bit_Loop
!$OMP END

RETURN
END FUNCTION Bits8

Another way would be something like:

PURE FUNCTION BITS8(nrows, ncols, startrow, startCol, EndRow, EndCol, DataIn) BIND(C, NAME='BITS8')
USE OMP_LIB
USE, INTRINSIC :: IOS_C_BINDING, ONLY: C_FLOAT, C_INT8_Y, C_INT
IMPLICIT NONE
INTEGER(C_INT8_T)                  , PARAMETER        :: nBits = 8
INTEGER(C_INT)                           , INTENT(IN) :: nRows
INTEGER(C_INT)                           , INTENT(IN) :: nCols
INTEGER(C_INT)                           , INTENT(IN) :: StartRow
INTEGER(C_INT)                           , INTENT(IN) :: StartCol
INTEGER(C_INT)                           , INTENT(IN) :: EndRow
INTEGER(C_INT)                           , INTENT(IN) :: EndCol
INTEGER(C_INT8_T), DIMENSION(ncols,nrows), INTENT(IN) :: DataIn
!DIR$ ATTRIBUTES ALIGN : 64                           :: Bits8
INTEGER(C_INT), DIMENSION(nBits)                      :: Bits8
INTEGER(C_INT)                                        :: I, J, K

!DIR$ ASSUME_ALIGNED DataIn:64
Bits8 = 0

Row_Loop: DO J = StartCol, EndCol
!DIR$ PREFETCH       DataIn:1:64
  Col_Loop: DO I = StartRow, EndRow
    !$OMP SIMD REDUCTION(+:Bits8) LINEAR(DataIn) SAFELEN(64)
    Bit_Loop: DO K = 0, nBits-1
      Bits8(K+1) = IBITS(DataIn(I,J),K, 1) + Bits8(K+1)
    ENDDO Bit_Loop
  ENDDO Sum_Loop
ENDDO Sum_Loop
!$OMP END

RETURN
END FUNCTION Bits8

All that aside, I think your data[y][x]&0x1 should be able to work with some #pragma vector always or #pragma simd (etc)... The -vec-report 3 should allow you work it out.

If not, then the small section, inlined might be best??

I do not know what you need, but I get >250 MB/second of bit throughput with that on a single core in the first example... So you know what to expect.

I am pretty convinced that the best way is just to histogram the data. Then do the bit testing on each histogram index value and multiply by the histogram bin count at that bin. Certainly it is faster for large count values. Once you know the bit pattern per histogram index that part never changes. So for large 'summing count numbers' and small 'byte sizes' that would certainly be faster. For small count size and 64 bits or larger, then it may be faster using IBITS.

The histogram was covered on the intel webinar from around September-16 (c and fortran).

The one advantage of fortran is that for a single byte, one can have the histogram be dimensioned as (-128:128) which makes it straightforward to drop the value into the right bin.

thanks in advance! However, in my code this hotspot is already within OMP parallel region, so I might have to live without it here :( — veritas, Nov 16 '16 at 19:44
When I tried to make this parallel it was slower with lots cache misses. You probably want to vectorize it before you parallelize it... Which is what your original question pertained to, so I know you realise it is important. The use of PURE (or ELEMENTAL) specifically makes going parallel easy, but how much throughput are you getting ? and how much do you need? — Holmz, Nov 16 '16 at 21:05
If you data is arraigned left to right in rows, and then a new row after that, then for columns you have a stride which not contiguous. In f90 there is PACK, RESHAPE, and TRANSPOSE... somehow you need the data to be CONTIGUOUS to be able to use SIMD. Sure one can also do that in C. You want to look at Intel's CILK plus, which looks a lot like fortran90. — Holmz, Nov 17 '16 at 07:58

AVX512 Performance Recommendation on Bit Test and Operation

1 Answers1