Parallel Image Processing in OpenMP - Splitting Image

Question

I have a function defined by Intel IPP to operate on an Image / Region of Image.
The input to the image are the pointer to the image, parameters to define the size to process and parameters of the filter.
The IPP function is single threaded.

Now, I have an image of size M x N.
I want to apply the filter on it in parallel.
The main idea is simple, break the image into 4 sub images which are independent of each other.
Apply the filter to each sub image and write the result to a sub block of an empty image where each thread write to a distinct set of pixels.
It's really like processing 4 images each on it own core.

This is the program I'm doing it with:

void OpenMpTest()
{
    const int width  = 1920;
    const int height = 1080;

    Ipp32f input_image[width * height];
    Ipp32f output_image[width * height];

    IppiSize size = { width, height };

    int step = width * sizeof(Ipp32f);

    /* Splitting the image */
    IppiSize section_size = { width / 2, height / 2};

    Ipp32f* input_upper_left  = input_image;
    Ipp32f* input_upper_right = input_image + width / 2;
    Ipp32f* input_lower_left  = input_image + (height / 2) * width;
    Ipp32f* input_lower_right = input_image + (height / 2) * width + width / 2;

    Ipp32f* output_upper_left  = input_image;
    Ipp32f* output_upper_right = input_image + width / 2;
    Ipp32f* output_lower_left  = input_image + (height / 2) * width;
    Ipp32f* output_lower_right = input_image + (height / 2) * width + width / 2;

    Ipp32f* input_sections[4] = { input_upper_left, input_upper_right, input_lower_left, input_lower_right };
    Ipp32f* output_sections[4] = { output_upper_left, output_upper_right, output_lower_left, output_lower_right };

    /* Filter Params */
    Ipp32f pKernel[7] = { 1, 2, 3, 4, 3, 2, 1 };

    omp_set_num_threads(4);
    #pragma omp parallel for
    for (int i = 0; i < 4; i++)
        ippiFilterRow_32f_C1R(
                              input_sections[i], step,
                              output_sections[i], step,
                              section_size, pKernel, 7, 3);
}

Now, the issues is I see no gain versus working Single Threaded mode on all image.
I tried to change the image size or filter size and nothing will the change the picture.
The most I could gain was nothing significant (10-20%).

I thought it might have something to do with that I can't "Promise" each thread the zone it received is "Read Only".
Moreover to let it know the memory location it writes to is also belongs only to himself.
I read about defining variables as private and share, yet I couldn't find a guide to deal with arrays and pointers.

What would be the proper way to deal with pointers and sub arrays in OpenMP?

You operation is memory bandwidth bound and so it can't scale with the number of physical cores (unless your filter is much larger). However, I would expect more than a 10-20% improvement. Normally, I don't parallelize a loop based on the number of threads. I would do it for the number of pixels or something. — Z boson, Mar 30 '15 at 07:41
How can I prove to my self that this problem is memory bounded? — Royi, Mar 30 '15 at 16:39
You should probably use some profiling tool for that. But that's not what I do. I determine the FLOPS of the operation and compare that to the peak FLOPS of the processor. I also determine how much bandwidth the operation is using (you can calculate this as well) and compare that to the peak bandwidth of the processor. If the operation is much less than the peak FLOPS and bound by the bandwidth then it's memory bandwidth bound. — Z boson, Apr 10 '15 at 07:11

score 0 · Answer 1 · answered Mar 29 '15 at 11:55

How does the performance of threaded IPP compare? Assuming no race conditions, performance problems with writing to shared arrays are most likely to occur in cache lines where part of the line is written by one thread and another part is read by another. It's likely to require a data region larger than a 10 megabytes or so before full parallel speedup is seen.
You would need deeper analysis, e.g. by Intel VTune Amplifier, to see whether memory bandwidth or data overlaps are limiting performance.

score 0 · Accepted Answer · answered Apr 10 '15 at 08:59

Using Intel IPP Filter, the best solution was using:

 int height  = dstRoiSize.height;
        int width   = dstRoiSize.width;
       Ipp32f *pSrc1, *pDst1;
        int nThreads, cH, cT;

#pragma omp parallel  shared( pSrc, pDst, nThreads, width, height, kernelSize,\
                             xAnchor, cH, cT ) private( pSrc1, pDst1 )
        {
    #pragma omp master
            {
                nThreads = omp_get_num_threads();
                cH = height / nThreads;
                cT = height % nThreads;
            }
    #pragma omp barrier
            {
                int curH;
                int id = omp_get_thread_num();

                pSrc1 = (Ipp32f*)( (Ipp8u*)pSrc + id * cH * srcStep );
                pDst1 = (Ipp32f*)( (Ipp8u*)pDst + id * cH * dstStep );
                if( id != ( nThreads - 1 )) curH = cH;
                else curH = cH + cT;
                ippiFilterRow_32f_C1R( pSrc1, srcStep, pDst1, dstStep,
                            width, curH, pKernel, kernelSize, xAnchor );
            }
        }

Thank You.

Parallel Image Processing in OpenMP - Splitting Image

2 Answers2