4

I'm trying to optimize the rotation of very large images, smallest is 4096x4096 or ~16 million pixels.

Rotation is always about the center of image, and images are not necessarily always square but will always be a power of 2.

I have access to MKL/TBB, where MKL is an optimized BLAS for my target platforms. I'm unfamiliar if this operation is in BLAS at all.

My best attempts so far are around 17-25ms (very inconsistent for the same image size, which means I'm probably stomping all over cache) for 4096x4096 images. Matrices are 16 byte aligned.

Now, the destination cannot be resized. So, clipping should and can occur. For instance, a square matrix rotated at 45 degrees will certainly clip at the corners and the value at that location should be zero.

Right now, my best attempts use a tiled approach - no elegance is yet being put into tile sizes nor loop unrolling.

Here is my algorithm as it stands using TBB - http://threadingbuildingblocks.org/:

//- cosa = cos of the angle
//- sina = sin of angle
//- for those unfamiliar with TBB, this is giving me blocks of rows or cols that
//- are unique per thread
void operator() ( const tbb::blocked_range2d<size_t, size_t> r ) const
{
    double xOffset;
    double yOffset;
    int lineOffset;

    int srcX;
    int srcY;

    for ( size_t row = r.rows().begin(); row != r.rows().end(); ++row )
    {
        const size_t colBegin = r.cols().begin();
        xOffset = -(row * sina) + xHelper + (cosa * colBegin);
        yOffset =  (row * cosa) + yHelper + (sina * colBegin);
        lineOffset = ( row * rowSpan );  //- all col values are offsets of this row
        for( size_t col = colBegin; col != r.cols().end(); ++col, xOffset += cosa, yOffset += sina )
        {
            srcX = xOffset;
            srcY = yOffset;

            if( srcX >= 0 && srcX < colSpan && srcY >= 0 && srcY < rowSpan ) 
            {
                destData[col + lineOffset] = srcData[srcX + ( srcY * rowSpan )]; 
            }
        }
    }
}

I make a call to this function as such:

double sina = sin(angle);
double cosa = cos(angle);
double centerX = (colSpan) / 2;
double centerY = (rowSpan) / 2;

//- Adding .5 for rounding
const double xHelper = centerX - (centerX * cosa) + (centerY * sina) + .5;
const double yHelper = centerY - (centerX * sina) - (centerY * cosa) + .5;
tbb::parallel_for( tbb::blocked_range2d<size_t, size_t>( 0, rowSpan, 0, colSpan ), DoRotate( sina, cosa, xHelper, yHelper, rowSpan, colSpan, (fcomplex *)pDestData, (fcomplex *)pSrcData ) );

fcomplex is just in house representation of complex numbers. It is defined as:

struct fcomplex
{
    float real;
    float imag;
};

So, I want to do rotation of a matrix of complex values about it's center at an arbitrary angle for very large images as fast as possible.

Update:

Based on wonderful feedback, I've updated to this: Which is about a 40% increase. I'm wondering though if anything else can be done:

void operator() ( const tbb::blocked_range2d<size_t, size_t> r ) const
{
    float xOffset;
    float yOffset;
    int lineOffset;

    __m128i srcXints;
    __m128i srcYints;

    __m128 dupXOffset;
    __m128 dupYOffset;

    for ( size_t row = r.rows().begin(); row != r.rows().end(); ++row )
    {
        const size_t colBegin = r.cols().begin();
        xOffset = -(row * sina) + xHelper + (cosa * colBegin);
        yOffset =  (row * cosa) + yHelper + (sina * colBegin);
        lineOffset = ( row * rowSpan );  //- all col values are offsets of this row

        for( size_t col = colBegin; col != r.cols().end(); col+=4, xOffset += dupOffsetsX.m128_f32[3], yOffset += dupOffsetsY.m128_f32[3] )
        {
            dupXOffset = _mm_load1_ps(&xOffset); //- duplicate the x offset 4 times into a 4 float field
            dupYOffset = _mm_load1_ps(&yOffset); //- duplicate the y offset 4 times into a 4 float field

            srcXints = _mm_cvtps_epi32( _mm_add_ps( dupOffsetsX, dupXOffset ) );
            srcYints = _mm_cvtps_epi32( _mm_add_ps( dupOffsetsY, dupYOffset ) );

            if( srcXints.m128i_i32[0] >= 0 && srcXints.m128i_i32[0] < colSpan && srcYints.m128i_i32[0] >= 0 && srcYints.m128i_i32[0] < rowSpan ) 
            {
                destData[col + lineOffset] = srcData[srcXints.m128i_i32[0] + ( srcYints.m128i_i32[0] * rowSpan )]; 
            }

            if( srcXints.m128i_i32[1] >= 0 && srcXints.m128i_i32[1] < colSpan && srcYints.m128i_i32[1] >= 0 && srcYints.m128i_i32[1] < rowSpan ) 
            {
                destData[col + 1 + lineOffset] = srcData[srcXints.m128i_i32[1] + ( srcYints.m128i_i32[1] * rowSpan )]; 
            }

            if( srcXints.m128i_i32[2] >= 0 && srcXints.m128i_i32[2] < colSpan && srcYints.m128i_i32[2] >= 0 && srcYints.m128i_i32[2] < rowSpan ) 
            {
                destData[col + 2 + lineOffset] = srcData[srcXints.m128i_i32[2] + ( srcYints.m128i_i32[2] * rowSpan )]; 
            }

            if( srcXints.m128i_i32[3] >= 0 && srcXints.m128i_i32[3] < colSpan && srcYints.m128i_i32[3] >= 0 && srcYints.m128i_i32[3] < rowSpan ) 
            {
                destData[col + 3 + lineOffset] = srcData[srcXints.m128i_i32[3] + ( srcYints.m128i_i32[3] * rowSpan )]; 
            }
        }
    }
}

Update 2: I put a solution below, taking into account suggestions I got as answers as well as fixing a bug when rotating rectangles.

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Keck
  • 93
  • 2
  • 8
  • I [edited out the clutter](http://meta.stackexchange.com/questions/2950/should-hi-thanks-taglines-and-salutations-be-removed-from-posts) a 2nd time. – Bart Aug 04 '11 at 14:44
  • This screams for GPU computing. Perhaps that's an option for you. – Axel Gneiting Aug 04 '11 at 15:06
  • I remember in the old times using [Bresenham's algorithm](http://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm) to avoid floating point operations in such contexts. This can be an idea, since your xOffset and yOffset modifications seem to be good candidates for Bresenham. – Alexandre C. Aug 04 '11 at 15:08
  • @Axel Gneiting unfortunately, I cannot yet take this to the OpenCL or CUDA world. We have no guarantee of customer hardware. However, I've been given license for the next iteration to write gpu versions for customers with compatible hardware. Until then! – Keck Aug 04 '11 at 15:13

3 Answers3

3

You might be able to optimize quite a few things if you first perform a simple approximate rotation (90/190/270) degrees and then a final rotation between 0-90 degrees. E.g. you could then optimize the if( srcX >= 0 && srcX < colSpan && srcY >= 0 && srcY < rowSpan ) test, and it would be more cache-friendly. I'll bet that your profiling shows that a 91 degree rotation is a lot slower than a 1 degree rotation.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • It definitely does show this, here are 20 output timings: Sliding the image approximately 7.5 degrees each slide, times are in ms: `0.0274989 - 7.5 deg 0.0265466 - 15 deg 0.0246373 - 22.5 deg 0.0245368 - 30 deg 0.0246203 - 37.5 deg 0.0241758 - 45 deg 0.0232378 - 52.5 deg 0.0216996 - 60 deg 0.0233684 - 67.5 deg 0.021255 - 75 deg 0.0179 - 82.5 deg 0.0165404 - 90 deg 0.0179594 - 97.5 deg 0.0185623 - 105 deg 0.0203935 - 112.5 deg 0.0218039 - 120 deg 0.0243723 - 127.5 deg 0.0228118 - 135 deg 0.023037 - 142.5 deg` – Keck Aug 04 '11 at 15:03
  • 3
    BTW, to rotate a big image over 90 degrees, rotate 16x16 subblocks at a time. That should keep the cache happy. – MSalters Aug 04 '11 at 15:06
1

There's not much to optimize. Your algorithm is sound. You are writing to the dstData by row (which is good for caching/memory) forcing sequential writes per thread.

The only thing left is loop unrolling your inner for...loop ~4x (for a 32bit system) or 8x (for a 64bit system). It will probably net you around 10-20% improvement. Because of the nature of the problem (forcing random reads from srcData) you'll always have a variance in timing.

I will ponder further...

Your inner for...loop is a strong target for vectorization. Consider the static vectors:

// SSE instructions MOVDDUP (move and duplicate) MULPD (multiply packed double)
double[] vcosa = [cosa, cosa, cosa, cosa] * [1.0, 2.0, 3.0, 4.0]
double[] vsina = [sina, sina, sina, sina] * [1.0, 2.0, 3.0, 4.0]

vxOffset = [xOffset, xOffset, xOffset, xOffset]
vyOffset = [yOffset, yOffset, yOffset, yOffset]

// SSE instructions ADDPD (add packed double) and CVTPD2DQ (convert packed double to signed integer)
vsrcX = vcosa + vxOffset
vsrcY = vsina + vyOffset

x86's SSE instructions are perfectly suited for handling this type of data. Even the conversion from doubles to integers. The AVX instructions allowing 256 bit vectors (4 doubles) is even better suited.

LastCoder
  • 194
  • 2
  • I ended up converting to floats, since I'm going to integers anyways. I then did the SIMD on floats (doing 4 at a time is good!) My times are about 60% of what they were!! Times are in ms, 20 rotations: `0.0197117` `0.0204737` `0.0198210` `0.0179762` `0.0170747` `0.0167953` `0.0171950` `0.0169126` `0.0165822` `0.0184796` `0.0153101` `0.0143482` `0.0146376` `0.0149194` `0.0156976` `0.0166688` `0.0160494` `0.0171973` `0.0182094` `0.0184469` – Keck Aug 04 '11 at 17:41
1

Taking into account the suggestions offered, I've arrived at this solution. Also, I fixed a bug in my original implementation that caused a problem with rectangular images.

The suggestion of rotating firstly by 90 degrees (using an affine transformation and threading that, and then rotating by a smaller degree proved to be slower just from having to iterate over the matrix twice). Of course, many factors come in to play on that, and most likely memory bandwidth causes things to be more skewed. So, for the machine I'm testing and optimizing for, this solution proved to be the best I could offer.

Using 16x16 tiles:

class DoRotate
{
const double sina;
const double cosa;

const double xHelper;
const double yHelper;

const int rowSpan;
const int colSpan;

mutable fcomplex *destData;
const fcomplex *srcData;

const float *offsetsX;
const float *offsetsY;

__m128 dupOffsetsX;
__m128 dupOffsetsY;

public:
void operator() ( const tbb::blocked_range2d<size_t, size_t> r ) const
{
    float xOffset;
    float yOffset;
    int lineOffset;

    __m128i srcXints;
    __m128i srcYints;

    __m128 dupXOffset;
    __m128 dupYOffset;

    for ( size_t row = r.rows().begin(); row != r.rows().end(); ++row )
    {
        const size_t colBegin = r.cols().begin();
        xOffset = -(row * sina) + xHelper + (cosa * colBegin);
        yOffset =  (row * cosa) + yHelper + (sina * colBegin);
        lineOffset = ( row * colSpan );  //- all col values are offsets of this row

        for( size_t col = colBegin; col != r.cols().end(); col+=4, xOffset += (4 * cosa), yOffset += (4 * sina) )
        {
            dupXOffset = _mm_load1_ps(&xOffset); //- duplicate the x offset 4 times into a 4 float field
            dupYOffset = _mm_load1_ps(&yOffset); //- duplicate the y offset 4 times into a 4 float field

            srcXints = _mm_cvttps_epi32( _mm_add_ps( dupOffsetsX, dupXOffset ) );
            srcYints = _mm_cvttps_epi32( _mm_add_ps( dupOffsetsY, dupYOffset ) );

            if( srcXints.m128i_i32[0] >= 0 && srcXints.m128i_i32[0] < colSpan && srcYints.m128i_i32[0] >= 0 && srcYints.m128i_i32[0] < rowSpan ) 
            {
                destData[col + lineOffset] = srcData[srcXints.m128i_i32[0] + ( srcYints.m128i_i32[0] * colSpan )]; 
            }

            if( srcXints.m128i_i32[1] >= 0 && srcXints.m128i_i32[1] < colSpan && srcYints.m128i_i32[1] >= 0 && srcYints.m128i_i32[1] < rowSpan ) 
            {
                destData[col + 1 + lineOffset] = srcData[srcXints.m128i_i32[1] + ( srcYints.m128i_i32[1] * colSpan )]; 
            }

            if( srcXints.m128i_i32[2] >= 0 && srcXints.m128i_i32[2] < colSpan && srcYints.m128i_i32[2] >= 0 && srcYints.m128i_i32[2] < rowSpan ) 
            {
                destData[col + 2 + lineOffset] = srcData[srcXints.m128i_i32[2] + ( srcYints.m128i_i32[2] * colSpan )]; 
            }

            if( srcXints.m128i_i32[3] >= 0 && srcXints.m128i_i32[3] < colSpan && srcYints.m128i_i32[3] >= 0 && srcYints.m128i_i32[3] < rowSpan ) 
            {
                destData[col + 3 + lineOffset] = srcData[srcXints.m128i_i32[3] + ( srcYints.m128i_i32[3] * colSpan )]; 
            }
        }
    }
}

DoRotate( const double pass_sina, const double pass_cosa, const double pass_xHelper, const double pass_yHelper, 
             const int pass_rowSpan, const int pass_colSpan, const float *pass_offsetsX, const float *pass_offsetsY, 
             fcomplex *pass_destData, const fcomplex *pass_srcData ) : 
    sina(pass_sina), cosa(pass_cosa), xHelper(pass_xHelper), yHelper(pass_yHelper), 
    rowSpan(pass_rowSpan), colSpan(pass_colSpan),
    destData(pass_destData), srcData(pass_srcData)
{
    dupOffsetsX = _mm_load_ps(pass_offsetsX); //- load the offset X array into one aligned 4 float field
    dupOffsetsY = _mm_load_ps(pass_offsetsY); //- load the offset X array into one aligned 4 float field
}
};

and then to call the code:

double sina = sin(radians);
double cosa = cos(radians);

double centerX = (colSpan) / 2;
double centerY = (rowSpan) / 2;

//- Adding .5 for rounding to avoid periodicity
const double xHelper = centerX - (centerX * cosa) + (centerY * sina) + .5;
const double yHelper = centerY - (centerX * sina) - (centerY * cosa) + .5;

float *offsetsX = (float *)_aligned_malloc( sizeof(float) * 4, 16 );
offsetsX[0] = 0.0f;
offsetsX[1] = cosa;
offsetsX[2] = cosa * 2.0;
offsetsX[3] = cosa * 3.0;
float *offsetsY = (float *)_aligned_malloc( sizeof(float) * 4, 16 );
offsetsY[0] = 0.0f;
offsetsY[1] = sina;
offsetsY[2] = sina * 2.0;
offsetsY[3] = sina * 3.0;

//- tiled approach. Works better, but not by much.  A little more stays in cache
tbb::parallel_for( tbb::blocked_range2d<size_t, size_t>( 0, rowSpan, 16,  0, colSpan, 16 ), DoRotate( sina, cosa, xHelper, yHelper, rowSpan, colSpan, offsetsX, offsetsY, (fcomplex *)pDestData, (fcomplex *)pSrcData ) );

_aligned_free( offsetsX );
_aligned_free( offsetsY );

I'm in no way 100% positive this is the best answer. But, this is the best I was capable of offering. So, I figured I'd pass my solution on to the community.

Keck
  • 93
  • 2
  • 8