Parallelize C code for 2D Haar wavelet transform with OpenMP

Question

This is my first question. I'm trying to parallelize with openMP a 2d haar transform function in C. I obtained it here and modified accordingly. The program takes a black&white image, puts it into a matrix and computes one level of the haar wavelet transform. In the end it normalizes the values and writes the transformed image on the disk.

This is a resulting image 1 level of HDT

My problem is that the parallelized version runs quite slower than the serial one. For now I attach here a snippet from the main part I want to parallelize (later on I can put all the surrounding code):

void haar_2d ( int m, int n, double u[] )
// m & n are the dimentions (every image is a perfect square)
//u is the input array in **(non column-major!)** row-major order</del>
int i;
int j;
int k;
double s;
double *v;

int tid, nthreads, chunk;

s = sqrt ( 2.0 );

v = ( double * ) malloc ( m * n * sizeof ( double ) );

for ( j = 0; j < n; j++ )
{
    for ( i = 0; i < m; i++ )
    {
        v[i+j*m] = u[i+j*m];
    }
}
/*
Determine K, the largest power of 2 such that K <= M.
*/
k = 1;
while ( k * 2 <= m )
{
    k = k * 2;
}

/*   Transform all columns.  */

while ( n/2 < k ) // just 1 level of transformation
{
    k = k / 2;

    clock_t begin = clock();

    #pragma omp parallel shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid)
    {
        tid = omp_get_thread_num();
        printf("Thread %d starting...\n",tid);

        #pragma omp for schedule (dynamic)
        for ( j = 0; j < n; j++ )
        {
            for ( i = 0; i < k; i++ )
            {               
                v[i  +j*m] = ( u[2*i+j*m] + u[2*i+1+j*m] ) / s;
                v[k+i+j*m] = ( u[2*i+j*m] - u[2*i+1+j*m] ) / s;
            }
        }

    #pragma omp for schedule (dynamic)
    for ( j = 0; j < n; j++ )
    {
        for ( i = 0; i < 2 * k; i++ )
        {
            u[i+j*m] = v[i+j*m];
        }
    }
}//end parallel

clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf ( "Time for COLUMNS: %f ms\n", time_spent * 1000);

}//end while

// [...]code for rows
free ( v );

return;}

The timings more or less are:

Time for COLUMNS: 160.519000 ms // parallel
Time for COLUMNS: 62.842000 ms // serial

I have tried to re-arrange the pragmas in lots of different ways eg with static schedule, with sections, task and so on, also re-arranging the data scopes of the variables and dynamically allocating inside parallel regions. I thought it would be simple to parallelize a 2-level for, but now it has been two days that I'm struggling. Seeking for your help guys, I've already checked out near all the related questions here, but still not able to go on or, at least, understand the reasons. Thank you in advance. (CPU Intel Core i3-4005U CPU @ 1.70GHz × 4 threads, 2 cores )

UPDATE:

1) What about m & n, it is supposed to implement also rectangled images one day, so I just left it there.

2) I figured out that u is actually a normal array with a linearized matrix inside, that is row by row (I use PGM images).

3) The memcpy is a better option, so now I'm using it.

What about the main topic, I've tried to divide the job over n by spawning a task for each chunk and the result is a littel bit faster thatn the serial code. Now I know that the input matrix u is in good row-major order, the 2 fors seem to proceed accordingly, but I'm not sure about the timings: using both omp_get_wtime() and clock() I don't know how to measure the speedup. I did tests with different image sizes, from 16x16 up to 4096x4096, and the parallel version seems to be slower with clock() and faster with omp_get_wtime() and gettimeofday(). Do you have some suggestions of how to handle it correctly with OpenMP, or at least how to measure correctly the speedup?

while ( n/2 < k )
{
    k = k / 2;
    double start_time = omp_get_wtime();
    // clock_t begin = clock();
    #pragma omp parallel shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid) firstprivate(k)
    {
        nthreads = omp_get_num_threads();

         #pragma omp single
         {
          printf("Number of threads = %d\n", nthreads);

          int chunk = n/nthreads;
          printf("Chunks size = %d\n", chunk);
          printf("Thread %d is starting the tasks.\n", omp_get_thread_num());

          int h;

          for(h=0;h<n;h = h + chunk){
          printf("FOR CYCLE i=%d\n", h);

            #pragma omp task shared(s,v,u,n,m,nthreads,chunk) private(i,j,tid) firstprivate(h,k)
            {
                tid = omp_get_thread_num();
                 printf("Thread %d starts at %d position\n", tid , h);

                for ( j = h; j < h + chunk; j++ )
                {
                    for ( i = 0; i < k; i++ )
                    {
                        v[i  +j*m] = ( u[2*i+j*m] + u[2*i+1+j*m] ) / s;
                        v[k+i+j*m] = ( u[2*i+j*m] - u[2*i+1+j*m] ) / s;
                    }
                }
            }// end task
        }//end launching for
        #pragma omp taskwait
        }//end single
        }//end parallel region

        // clock_t end = clock();
        // double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
        // printf ( "COLUMNS: %f ms\n", time_spent * 1000);

        double time = omp_get_wtime() - start_time;
        printf ( "COLUMNS: %f ms\n", time*1000);

    for ( j = 0; j < n; j++ )
    {
        for ( i = 0; i < 2 * k; i++ )
        {
            u[i+j*m] = v[i+j*m];
        }
    }
 }//end while

What compiler and OS? `clock()` will only do what you want with the MSVC C runtime. In general use `omp_get_wtime()`. — Z boson, Jul 12 '16 at 06:30
I use gcc version 5.3.1 with Ubuntu 16.04 (kernel 4.4). I have implemented your advice, but is it correct I compare the time obtained with omp_get_wtime() for parallel code with the time obtained through clock() on the serial code? Thanks — p_koelio, Jul 12 '16 at 09:46

score 0 · Answer 1 · answered Jul 12 '16 at 03:05

I have a few questions that deeply concern me about your code.

m & n are the dimentions (every image is a perfect square)

Then why are there two size parameters?
u is the input array in column-major order

This is an incredibly bad idea. C uses a row-major ordering for memory, so column-major indexing leads to strided memory access. This is very, very bad for performance. If at all possible, you need to fix this.

Because both u and v are linearized matrices, then this

for (int j = 0; j < n; j++) {
    for (int i = 0; i < m; i++) {
        v[i + j * m] = u[i + j * m];
    }
}

can be replaced with a call to memcpy.

memcpy(v, u, m * n * sizeof(double));

On to your issue. The reason that your version using OpenMP is slower is because all of your threads are doing the same thing. This isn't useful and leads to bad things like false sharing. You need to use each thread's id (tid in your code) to partition the data across the threads; keeping in mind that false sharing is bad.

Thank you for your advices, I have updated the code to follow them, but I'm not sure if this is what you intended. Also I figured out that u is a nnormal array with a matrix linearized row by row, that is the first n entries is one row, then the second n entries the second row etc. — p_koelio, Jul 12 '16 at 12:18

score 0 · Accepted Answer · answered Nov 27 '16 at 15:43

0

The problem was that I was using clock() instead of omp_get_wtime(), thanks to Z boson.

answered Nov 27 '16 at 15:43

p_koelio

21
5

Parallelize C code for 2D Haar wavelet transform with OpenMP

2 Answers2