I've implemented a version of the Travelling Salesman with xmmintrin.h
SSE instructions, received a decent speedup. But now I'm also trying to implement OpenMP threading on top of it, and I'm seeing a pretty drastic slow down. I'm getting the correct answer in both cases (i.e. (i) with SSE only, or (ii) with SSE && OpenMP).
I know I am probably doing something wildly wrong, and maybe someone much more experienced than me can spot the issue.
The main loop of my program has the following (brief) pseudocode:
int currentNode;
for(int i = 0; i < numNodes; i++) {
minimumDistance = DBL_MAX;
minimumDistanceNode;
for(int j = 0; j < numNodes; j++) {
// find distance between 'currentNode' to j-th node
// ...
if(jthNodeDistance < minimumDistance) {
minimumDistance = jthNodeDistance;
minimumDistanceNode = jthNode;
}
}
currentNode = minimumDistanceNode;
}
And here is my implementation, that is still semi-pseudocode as I've still brushed over some parts that I don't think have an impact on performance, I think the issues to be found with my code can be found in the following code snippet. If you just omit the #pragma
lines, then the following is pretty much identical to the SSE only version of the same program, so I figure I should only include the OpenMP version:
int currentNode = 0;
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++) {
miniumum = DBL_MAX;
__m128 currentNodeX = _mm_set1_ps(xCoordinates[currentNode]);
__m128 currentNodeY = _mm_set1_ps(yCoordinates[currentNode]);
#pragma omp parallel num_threads(omp_get_max_threads())
{
float localMinimum = DBL_MAX;
float localMinimumNode;
#pragma omp for
for (int j = 0; j < loopEnd; j += 4) {
// a number of SSE vector calculations to find distance
// between the current node and the four nodes we're looking
// at in this iteration of the loop:
__m128 subXs_0 = _mm_sub_ps(currentNodeX, _mm_load_ps(&xCoordinates[j]));
__m128 squareSubXs_0 = _mm_mul_ps(subXs_0, subXs_0);
__m128 subYs_0 = _mm_sub_ps(currentNodeY, _mm_load_ps(&yCoordinates[j]));
__m128 squareSubYs_0 = _mm_mul_ps(subYs_0, subYs_0);
__m128 addXY_0 = _mm_add_ps(squareSubXs_0, squareSubYs_0);
float temp[unroll];
_mm_store_ps(&temp[0], addXY_0);
// skipping stuff here that is about getting the minimum distance and
// it's equivalent node, don't think it's massively relevant but
// each thread will have its own
// localMinimum
// localMinimumNode
}
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
So my logic is:
omp single
for the top-level for(int i) for loop, and I only want 1 thread dedicated to thisomp parallel num_threads(omp_get_max_threads())
for the inner for(int j) for-loop, as I want all cores working on this part of the code at the same time.omp critical
at the end of the full for(int j) loop, as I want to thread-safely update the current node.
In terms of run-time, the OpenMP version is typically twice as slow as the SSE-only version.
Does anything jump out at you as particularly bad in my code, that is causing this drastic slow-down for OpenMP?