Mixing OpenMP and xmmintrin SSE Intrinsics - not getting speedup over the non-parallel version

Question

I've implemented a version of the Travelling Salesman with xmmintrin.h SSE instructions, received a decent speedup. But now I'm also trying to implement OpenMP threading on top of it, and I'm seeing a pretty drastic slow down. I'm getting the correct answer in both cases (i.e. (i) with SSE only, or (ii) with SSE && OpenMP).

I know I am probably doing something wildly wrong, and maybe someone much more experienced than me can spot the issue.

The main loop of my program has the following (brief) pseudocode:

int currentNode; 

for(int i = 0; i < numNodes; i++) {
    minimumDistance = DBL_MAX;
    minimumDistanceNode;

    for(int j = 0; j < numNodes; j++) {
        // find distance between 'currentNode' to j-th node
        // ...
        if(jthNodeDistance < minimumDistance) {
            minimumDistance = jthNodeDistance;
            minimumDistanceNode = jthNode;
        }
    }
    currentNode = minimumDistanceNode;
}

And here is my implementation, that is still semi-pseudocode as I've still brushed over some parts that I don't think have an impact on performance, I think the issues to be found with my code can be found in the following code snippet. If you just omit the #pragma lines, then the following is pretty much identical to the SSE only version of the same program, so I figure I should only include the OpenMP version:

    int currentNode = 0;

    #pragma omp parallel
    {
        #pragma omp single
        {
            for (int i = 1; i < totalNum; i++) {
            miniumum = DBL_MAX;

            __m128 currentNodeX = _mm_set1_ps(xCoordinates[currentNode]);
            __m128 currentNodeY = _mm_set1_ps(yCoordinates[currentNode]);

            #pragma omp parallel num_threads(omp_get_max_threads())
            {
                float localMinimum = DBL_MAX;
                float localMinimumNode;

                #pragma omp for 
                for (int j = 0; j < loopEnd; j += 4) {
                    // a number of SSE vector calculations to find distance
                    // between the current node and the four nodes we're looking
                    // at in this iteration of the loop:
                    __m128 subXs_0 = _mm_sub_ps(currentNodeX, _mm_load_ps(&xCoordinates[j]));
                    __m128 squareSubXs_0 = _mm_mul_ps(subXs_0, subXs_0);
                    __m128 subYs_0 = _mm_sub_ps(currentNodeY, _mm_load_ps(&yCoordinates[j]));
                    __m128 squareSubYs_0 = _mm_mul_ps(subYs_0, subYs_0);
                    __m128 addXY_0 = _mm_add_ps(squareSubXs_0, squareSubYs_0);

                    float temp[unroll];
                    _mm_store_ps(&temp[0], addXY_0);

                    // skipping stuff here that is about getting the minimum distance and
                    // it's equivalent node, don't think it's massively relevant but
                    // each thread will have its own
                    //  localMinimum
                    //  localMinimumNode
                }


                // updating the global minimumNode in a thread-safe way
                #pragma omp critical (update_minimum)
                {
                    if (localMinimum < minimum) {
                        minimum = localMinimum;
                        minimumNode = localMinimumNode;
                    }
                }
            }

            // within the 'omp single'
            ThisPt = minimumNode;
        }
        }
    }

So my logic is:

omp single for the top-level for(int i) for loop, and I only want 1 thread dedicated to this
omp parallel num_threads(omp_get_max_threads()) for the inner for(int j) for-loop, as I want all cores working on this part of the code at the same time.
omp critical at the end of the full for(int j) loop, as I want to thread-safely update the current node.

In terms of run-time, the OpenMP version is typically twice as slow as the SSE-only version.

Does anything jump out at you as particularly bad in my code, that is causing this drastic slow-down for OpenMP?

Any feedback on what I have posted? – dreamcrash May 15 '21 at 08:33 — dreamcrash, May 15 '21 at 08:33

dreamcrash · Answer 1 · 2021-04-06T07:22:10.630

Does anything jump out at you as particularly bad in my code, that is causing this drastic slow-down for OpenMP?

First:

omp single for the top-level for(int i) for loop, and I only want 1 thread dedicated to this

In your code you have the following:

#pragma omp parallel
{
    #pragma omp single
    {
        for (int i = 1; i < totalNum; i++) 
        {
           #pragma omp parallel num_threads(omp_get_max_threads())
           {
             //....
           }

          // within the 'omp single'
          ThisPt = minimumNode;
       }
    }
}

The #pragma omp parallel creates a team of threads, but then only one thread executes a parallel task (i.e., #pragma omp single) while the other threads don't do anything. You can simplified to:

    for (int i = 1; i < totalNum; i++) 
    {
       #pragma omp parallel num_threads(omp_get_max_threads())
       {
         //....
       }

      ThisPt = minimumNode;
   }

The inner only is still executed by only one thread.

Second :

omp parallel num_threads(omp_get_max_threads()) for the inner for(int j) for-loop, as I want all cores working on this part of the code at the same time.

The problem is that this might return the number of logic-cores and not physical cores, and some codes might perform worse with hyper-threading. So, I would first test with a different number of threads, starting from 2, 4 and so on, until you find a number to which the code stops scaling.

omp critical at the end of the full for(int j) loop, as I want to thread-safely update the current node.

        // updating the global minimumNode in a thread-safe way
        #pragma omp critical (update_minimum)
        {
            if (localMinimum < minimum) {
                minimum = localMinimum;
                minimumNode = localMinimumNode;
            }
        }

this can be replaced by creating an array where each thread save its local minimum in a position reserved to that thread, and outside the parallel region the initial thread extract the minimum and minimumNode:

        int total_threads = /..;
        float localMinimum[total_threads] = {DBL_MAX};
        float localMinimumNode[total_threads] = {DBL_MAX};
         
        #pragma omp parallel num_threads(total_threads)
        {
          /... 
        }
        for(int i = 0; i < total_threads; i++){
            if (localMinimum[i] < minimum) {
                minimum = localMinimum[i];
                minimumNode = localMinimumNode[i];
            }
        }

Finally, after those changes are done, you try to check if it is possible to replace this parallelization by the following:

    #pragma omp parallel for
    for (int i = 1; i < totalNum; i++) 
    {
       ...
    }

Mixing OpenMP and xmmintrin SSE Intrinsics - not getting speedup over the non-parallel version

1 Answers1