0

I am Converting an MMX SSE to Equivalent C Code. I have almost converted it but the image quality what I am getting is not proper or I can see some noise is coming in image. I am debugging the code from last 5 days but I am not getting any reason why it is happening. I will be very very happy if you guys look into the issue and help me out.

ORIGINAL SSE CODE:

void unpack_8bit_to_16bit( __m128i *a, __m128i* b0, __m128i* b1 ) 
{
    __m128i zero = _mm_setzero_si128();
    b0 = _mm_unpacklo_epi8( a, zero );
    b1 = _mm_unpackhi_epi8( a, zero );
}

void convolve_cols_3x3( const unsigned char* in, int16_t* out_v, int16_t* out_h, int w, int h )
{
    using namespace std;
    assert( w % 16 == 0 && "width must be multiple of 16!" );
    const int w_chunk  = w/16;
    __m128i*    i0       = (__m128i*)( in );
    __m128i*    i1       = (__m128i*)( in ) + w_chunk*1;
    __m128i*    i2       = (__m128i*)( in ) + w_chunk*2;
    __m128i* result_h  = (__m128i*)( out_h ) + 2*w_chunk;
    __m128i* result_v  = (__m128i*)( out_v ) + 2*w_chunk;
    __m128i* end_input = (__m128i*)( in ) + w_chunk*h;

    for( ; i2 != end_input; i0++, i1++, i2++, result_v+=2, result_h+=2 ) 
    {
        *result_h     = _mm_setzero_si128();
        *(result_h+1) = _mm_setzero_si128();
        *result_v     = _mm_setzero_si128();
        *(result_v+1) = _mm_setzero_si128();
        __m128i ilo, ihi;
        unpack_8bit_to_16bit( *i0, ihi, ilo ); 
        *result_h     = _mm_add_epi16( ihi, *result_h );
        *(result_h+1) = _mm_add_epi16( ilo, *(result_h+1) );
        *result_v     = _mm_add_epi16( *result_v, ihi );
        *(result_v+1) = _mm_add_epi16( *(result_v+1), ilo );
        unpack_8bit_to_16bit( *i1, ihi, ilo );
        *result_v     = _mm_add_epi16( *result_v, ihi );
        *(result_v+1) = _mm_add_epi16( *(result_v+1), ilo );
        *result_v     = _mm_add_epi16( *result_v, ihi );
        *(result_v+1) = _mm_add_epi16( *(result_v+1), ilo );
        unpack_8bit_to_16bit( *i2, ihi, ilo );
        *result_h     = _mm_sub_epi16( *result_h, ihi );
        *(result_h+1) = _mm_sub_epi16( *(result_h+1), ilo );
        *result_v     = _mm_add_epi16( *result_v, ihi );
        *(result_v+1) = _mm_add_epi16( *(result_v+1), ilo );
    }
}

The code that I converted is given below

void convolve_cols_3x3( const unsigned char* in, int16_t* out_v, int16_t* out_h, int w, int h )
{
    using namespace std;
    assert( w % 16 == 0 && "width must be multiple of 16!" );
    const int w_chunk  = w/16;

    uint8_t*    i0       = (uint8_t*)( in );
    uint8_t*    i1       = (uint8_t*)( in ) + w_chunk*1*16;
    uint8_t*    i2       = (uint8_t*)( in ) + w_chunk*2*16;
    int16_t* result_h  = (int16_t*)( out_h ) + 2*w_chunk*16;
    int16_t* result_v  = (int16_t*)( out_v ) + 2*w_chunk*16;
    uint8_t* end_input = (uint8_t*)( in ) + w_chunk*h*16;
    for( ; i2 != end_input; i0+= 16, i1+= 16, i2+= 16, result_v+= 16, result_h+= 16 ) 
    {
        for (int i=0; i<8;i++)
        {
            result_h[i]     = 0;
            result_h[i + 8] = 0;
            result_v[i]        = 0;
            result_v[i + 8] = 0;
            result_h[i]     = (int16_t)(i0[i]) + result_h[i] ;
            result_h[i + 8] = (int16_t)(i0[i + 8]) + result_h[i + 8] ;
            result_v[i]     = (int16_t)(i0[i]) + result_v[i] ;
            result_v[i + 8] = (int16_t)(i0[i + 8]) + result_v[i + 8] ;
            result_v[i]     = (int16_t)(i1[i]) + result_v[i] ;
            result_v[i + 8] = (int16_t)(i1[i + 8]) + result_v[i + 8] ;
            result_v[i]     = (int16_t)(i1[i]) + result_v[i] ;
            result_v[i + 8] = (int16_t)(i1[i + 8]) + result_v[i + 8] ;
            result_h[i]     = result_h[i] - (int16_t)(i2[i]);
            result_h[i + 8] = result_h[i + 8] - (int16_t)(i2[i + 8]);
            result_v[i]     = (int16_t)(i2[i]) + result_v[i] ;
            result_v[i + 8] = (int16_t)(i2[i + 8]) + result_v[i + 8] ;
        }
    }
}

Sorry if the code is not that readable. w and h represents width and height. out_h and out_v are two parameters that are later used for other purpose.

Christian Rau
  • 45,360
  • 10
  • 108
  • 185
  • There seems to be at least one bug in the SSE code: `ilo` and `ihi` are not initialised before being used. Or did you accidentally delete an `unpack_8bit_to_16bit` line ? – Paul R Oct 03 '12 at 15:05
  • @paul:: ihi and ilo are passed by reference to unpack_8bit_to_16bit(---) function so they get initialized over there in original sse code... and in converted code I have directly used i0, i1 and i2 to minimize function calls... – user1717323 Oct 03 '12 at 16:20
  • Look at the code again - specifically these two lines: `__m128i ilo, ihi; *result_h = _mm_add_epi16( ihi, *result_h );` – Paul R Oct 03 '12 at 16:57
  • i am very sry paul... there is one line after __ma128i ilo,ihi. the line is unpack_8bit_to_16bit( *i0, ihi, ilo ); – user1717323 Oct 03 '12 at 17:05

1 Answers1

0

The bugs appear to be in your pointer math and reading of the source data. The pointer variable i0, i1, i2 are unsigned char. Lines like this in your code:

 result_h[i + 8] = (int16_t)(i0[i + 8]) + result_h[i + 8] ;

Should be this:

result_h[i + 8] = (int16_t)(i0[i*2 + 16]) + result_h[i + 8] ;

The cast to int16_t doesn't affect the offset within the square brackets of i0. You're working with 16-byte structures (__m128i), but accessing them at 8-byte offsets. You're also only using the lower 8 bits of the integers pointed to by i0 and i1. In the original SSE code you're reading 16-bit integers. The final corrected code might look like this if you need to read 16-bit integers before the addition:

result_h[i + 8] = *(int16_t *)(&i0[i*2 + 16]) + result_h[i + 8];
BitBank
  • 8,500
  • 3
  • 28
  • 46