How to optimize a n-queens OpenMP parallel program?

Question

I'm working on parallelizing the n-queens problem using OpenMP, but my sequential program is just as fast. I've been trying to use num_threads, but I don't think I am doing it correctly.

Can someone look at my code and tell me what I am doing wrong or give me some pointers? Thank you.

Here is my parallel program:

// Parallel version of the N-Queens problem.


#include <iostream>  
#include <omp.h>
#include <time.h>
#include <sys/time.h>

// Timing execution
double startTime, endTime;

// Number of solutions found
int numofSol = 0;

// Board size and number of queens
#define N 11

void placeQ(int queens[], int row, int column) {
    
    for(int i = 0; i < row; i++) {
        // Vertical
        if (queens[i] == column) {
            return;
        }
        
        // Two queens in the same diagonal
        if (abs(queens[i] - column) == (row-  i))  {
            return;
        }
    }
    
    // Set the queen
    queens[row] = column;
    
    if(row == N-1) {
        
        #pragma omp atomic 
            numofSol++;  //Placed the final queen, found a solution
        
        #pragma omp critical
        {
            std::cout << "The number of solutions found is: " << numofSol << std::endl; 
            for (int row = 0; row < N; row++) {
                for (int column = 0; column < N; column++) {
                    if (queens[row] == column) {
                        std::cout << "X";
                    }
                    else {
                        std::cout << "|";
                    }
                }
                std::cout  << "\n"  << std::endl; 
            }
        }
    }
    
    else {
        
        // Increment row
        for(int i = 0; i < N; i++) {
            placeQ(queens, row + 1, i);
        }
    }
} // End of placeQ()

void solve() {
    #pragma omp parallel num_threads(30)
    #pragma omp single
    {
        for(int i = 0; i < N; i++) {
            // New task added for first row and each column recursion.
            #pragma omp task
            { 
                placeQ(new int[N], 0, i);
            }
        }
    }
} // end of solve()

int main(int argc, char*argv[]) {

    startTime = omp_get_wtime();   
    solve();
    endTime = omp_get_wtime();
  
    // Print board size, number of solutions, and execution time. 
    std::cout << "Board Size: " << N << std::endl; 
    std::cout << "Number of solutions: " << numofSol << std::endl; 
    std::cout << "Execution time: " << endTime - startTime << " seconds." << std::endl; 
    
    return 0;
}

score 4 · Accepted Answer · edited Apr 18 '21 at 11:05

More than 95% of the execution time of your program is spent in printing strings in the console and this operation is put in a critical section so that only one thread can do it at a time. The overhead of the IO operations and the critical section grows with the number of threads used. Consequently, the sequential execution time is better than the parallel one.

Actually, to be more precise, it is not the printing that is slow, but the synchronization with the console caused by std::endl which implicitly performs a std::flush, and the string formatting. Thus, to fix that, you can prepare a thread-local string in parallel (std::ostringstream is good for that). The local string can then be appended to a big global one and its content can be printed in the main thread sequentially (to prevent any additional overhead caused by parallel IOs) and outside the timed section.

Besides this, you have 11 tasks and you create 30 threads for that in your code while you probably have less than 30 cores (or even 30 hardware threads). Creating too many threads is costly (mainly due to thread-preemption/scheduling). Moreover, specifying the number of threads in the program is generally a bad practice. Please use the portable environment variable OMP_NUM_THREADS for that.

Here is the code tacking into account the above remarks:

// Parallel version of the N-Queens problem.

#include <iostream>  
#include <omp.h>
#include <time.h>
#include <sys/time.h>
#include <sstream>

// Timing execution
double startTime, endTime;

// Number of solutions found
int numofSol = 0;

std::ostringstream globalOss;

// Board size and number of queens
#define N 11

void placeQ(int queens[], int row, int column) {
    
    for(int i = 0; i < row; i++) {
        // Vertical
        if (queens[i] == column) {
            return;
        }
        
        // Two queens in the same diagonal
        if (abs(queens[i] - column) == (row-  i))  {
            return;
        }
    }
    
    // Set the queen
    queens[row] = column;
    
    if(row == N-1) {
        
        #pragma omp atomic 
            numofSol++;  //Placed the final queen, found a solution
        
        std::ostringstream oss;
        oss << "The number of solutions found is: " << numofSol << std::endl; 
        for (int row = 0; row < N; row++) {
            for (int column = 0; column < N; column++) {
                if (queens[row] == column) {
                    oss << "X";
                }
                else {
                    oss << "|";
                }
            }
            oss  << std::endl << std::endl; 
        }

        #pragma omp critical
        globalOss << oss.str();
    }
    
    else {
        
        // Increment row
        for(int i = 0; i < N; i++) {
            placeQ(queens, row + 1, i);
        }
    }
} // End of placeQ()

void solve() {
    #pragma omp parallel //num_threads(30)
    #pragma omp single
    {
        for(int i = 0; i < N; i++) {
            // New task added for first row and each column recursion.
            #pragma omp task
            { 
                placeQ(new int[N], 0, i);
            }
        }
    }
} // end of solve()

int main(int argc, char*argv[]) {

    startTime = omp_get_wtime();   
    solve();
    endTime = omp_get_wtime();

    std::cout << globalOss.str();
  
    // Print board size, number of solutions, and execution time. 
    std::cout << "Board Size: " << N << std::endl; 
    std::cout << "Number of solutions: " << numofSol << std::endl; 
    std::cout << "Execution time: " << endTime - startTime << " seconds." << std::endl; 
    
    return 0;
}

Here are the resulting execution time on my machine:

Time of the reference implementation (30 threads): 0.114309 s

Optimized implementation:
1 thread: 0.018634 s (x1.00)
2 thread: 0.009978 s (x1.87)
3 thread: 0.006840 s (x2.72)
4 thread: 0.005766 s (x3.23)
5 thread: 0.004941 s (x3.77)
6 thread: 0.003963 s (x4.70)

If you want an even faster parallel code, you can:

provide a bit more tasks to OpenMP (to improve the work load-balancing), but not too many (due to the overhead of each task);
reduce the amount of (implicit) allocations;
perform a thread-local reduction on numofSol and use just one atomic update per task.

How to optimize a n-queens OpenMP parallel program?

1 Answers1