I'm new to Golang and learning to understand the concurrent programming model. I wrote two concurrent solutions for the N-Queens problem and couldn't find an explanation why the second one is significantly faster than the first(about 25 times faster), though I believe they are almost equivalent.
The First Solution
A semaphore (to control the number of active workers concurrently) and a signal channel (connects workers and the control thread that counts current workers to determine when to exit.) are passed as context to the worker goroutines.
Here is the main goroutine function Solve()
:
var (
MAX_THREADS = runtime.NumCPU() * 2
)
// Solve() searches the solution space of n-queens problem in parallel.
func Solve(n int) (ans int, err error) {
// MAX_PROBLEM_SIZE is 32 because the diagonals need 2n-1 bits to encode while int64 has only 64 bits.
if n > MAX_PROBLEM_SIZE {
err = fmt.Errorf("problem size exceeds limit(%d)", MAX_PROBLEM_SIZE)
return
}
workingLevel, _ := getWorkingLevel(n)
ctx := SolverContext{
signalChan: make(chan SolverSignal, MAX_THREADS), // channel that connects workers and the main thread
pool: semaphore.NewWeighted(int64(MAX_THREADS)), // a semaphore limiting the maximum working concurrenly
workingLevel: workingLevel, // on which row the workers are created
}
// kickstart the top level worker
ctx.pool.Acquire(ctx, 1)
ctx.signalChan <- WORKER_START
go func() {
grow(0, 0, 0, 0, n, &ctx)
ctx.signalChan <- WORKER_FINISH
ctx.pool.Release(1)
}()
worker_num := 0
// the control thread counts the number of valid solutions and exit when
// there's no more workers running.
for s := range ctx.signalChan {
switch s {
case WORKER_START:
worker_num++
case WORKER_FINISH:
worker_num--
case SOLUTION_FOUND:
ans++
}
if worker_num == 0 {
close(ctx.signalChan)
}
}
return
}
The recursive function grow()
tries to create a new worker on the workingLevel
to continue the search instead of going deeper down on the same goroutine.
func grow(colBits, slashBits, backslashBits int, row int, n int, ctx *SolverContext) {
if row == n {
// means all n queens are placed without meeting each other in any direction.
ctx.signalChan <- SOLUTION_FOUND
return
}
available := (1<<n - 1) &^ (colBits | slashBits | backslashBits)
for available != 0 {
pos := available & (-available)
growOnNewWorker := func(onNewWorker bool) {
grow(colBits|pos, (slashBits|pos)<<1, (backslashBits|pos)>>1, row+1, n, ctx)
if onNewWorker {
ctx.signalChan <- WORKER_FINISH
ctx.pool.Release(1)
}
}
if ctx.workingLevel == row+1 {
// try to create a new worker. blocks until the semaphore is once again acquirable.
if err := ctx.pool.Acquire(ctx, 1); err == nil {
ctx.signalChan <- WORKER_START
go growOnNewWorker(true)
}
} else {
growOnNewWorker(false)
}
available &^= pos
}
}
Following are some graphs that might help:
How grow()
searches for possible solutions
On which level my workers are created
On a specific search path, previously taken columns and diagonals are recorded by setting a certain bit of colBits
, slashBits
or baskslashBits
to 1
. How to efficiently determine the current (row, col)
position is safe to place a queen or not is irrelevant to the question. We can focus on the concurrent part of the code.
The function to find the appropriate working level:
func getWorkingLevel(n int) (level, levelSize int) {
branchesSum := 1
for i := 0; i < n; i++ {
branchesSum *= n
if branchesSum >= MAX_THREADS {
level, levelSize = i, branchesSum
return
}
}
return
}
The Second Solution
A bunch of workers are created in advance and they should be waiting on the taskParamChan
for new search tasks. The number of solutions found in each task are send into the ansChan
by these workers.
Here's the control thread function Solve()
:
type cellParams struct {
colBits int
slashBits int
bashSlashBits int
row int
}
// Solve() searches the solution space of n-th queen problem in parallel.
func Solve(n int) (ans int, err error) {
// MAX_PROBLEM_SIZE is 32 because the diagonals need 2n-1 bits to encode while uint64 has only 64 bits.
if n > MAX_PROBLEM_SIZE {
err = fmt.Errorf("problem size exceeds limit(%d)", MAX_PROBLEM_SIZE)
return
}
workingLevel, levelSize := getWorkingLevel(n)
// levelSize of tasks will be created in total
// so we need the buffered channel in this size
taskParamChan := make(chan cellParams, levelSize)
ansChan := make(chan int, levelSize)
// create a (real) pool of workers waiting on new tasks
for i := 0; i < MAX_THREADS; i++ {
go solverWorker(taskParamChan, n, ansChan)
}
// kickstart the search
remaining := grow(0, 0, 0, 0, n, &ans, taskParamChan, workingLevel)
if remaining == 0 {
return
}
for partial := range ansChan {
ans += partial
remaining--
if remaining == 0 {
close(ansChan)
close(taskParamChan)
}
}
return
}
The workers:
func solverWorker(taskParamChan chan cellParams, n int, ansChan chan<- int) {
for v := range taskParamChan {
partialAns := 0
// there's no need to sync access to variable partialAns here as only the current worker thread modifies it.
// return value of grow() is dropped because no new tasks will be created.
grow(v.colBits, v.slashBits, v.bashSlashBits, v.row, n, &partialAns, nil, 0)
ansChan <- partialAns
}
}
Function grow()
sends task parameters into paramChan
on the workingLevel
instead of creating a new goroutine. When paramChan
is nil, the function searches in its own goroutine until all the N-Queens solutions down the path are found.
func grow(
colBits, slashBits, backslashBits int,
row int, n int, pAns *int,
paramChan chan cellParams, workingLevel int,
) (paramsSent int) {
if row == n {
*pAns++
return
}
available := (1<<n - 1) &^ (colBits | slashBits | backslashBits)
for available != 0 {
pos := available & (-available)
// create new tasks only on the working level
if paramChan != nil && workingLevel == row+1 {
paramChan <- cellParams{
colBits: colBits | pos,
slashBits: (slashBits | pos) << 1,
bashSlashBits: (backslashBits | pos) >> 1,
row: row + 1,
}
paramsSent++
} else {
paramsSent += grow(colBits|pos, (slashBits|pos)<<1, (backslashBits|pos)>>1,
row+1, n, pAns,
paramChan, workingLevel,
)
}
available &^= pos
}
return
}
My function to test the N-Queens solver:
func TestSolver(t *testing.T) {
tests := []struct {
problemSize int
want int
}{
{problemSize: 8, want: 92},
{problemSize: 9, want: 352},
{problemSize: 10, want: 724},
{problemSize: 11, want: 2680},
{problemSize: 12, want: 14200},
{problemSize: 13, want: 73712},
{problemSize: 14, want: 365596},
{problemSize: 15, want: 2279184},
{problemSize: 16, want: 14772512},
}
for _, tt := range tests {
start := time.Now()
if res, _ := nqueens.Solve(tt.problemSize); res != tt.want {
t.Fatalf("got: %d, want: %d", res, tt.want)
} else {
t.Logf("Problem_size_%d test got %d in %d ms", tt.problemSize, res, time.Now().UnixMilli()-start.UnixMilli())
}
}
}
Test result of the first solution:
=== RUN TestSolver
solver_test.go:30: Problem Size 8 got 92 in 0 ms
solver_test.go:30: Problem Size 9 got 352 in 1 ms
solver_test.go:30: Problem Size 10 got 724 in 1 ms
solver_test.go:30: Problem Size 11 got 2680 in 3 ms
solver_test.go:30: Problem Size 12 got 14200 in 20 ms
solver_test.go:30: Problem Size 13 got 73712 in 92 ms
solver_test.go:30: Problem Size 14 got 365596 in 535 ms
solver_test.go:30: Problem Size 15 got 2279184 in 3292 ms
solver_test.go:30: Problem Size 16 got 14772512 in 21933 ms
--- PASS: TestSolver (25.88s)
And the second one:
=== RUN TestSolver
solver_test.go:30: Problem_size_8 test got 92 in 0 ms
solver_test.go:30: Problem_size_9 test got 352 in 1 ms
solver_test.go:30: Problem_size_10 test got 724 in 0 ms
solver_test.go:30: Problem_size_11 test got 2680 in 0 ms
solver_test.go:30: Problem_size_12 test got 14200 in 2 ms
solver_test.go:30: Problem_size_13 test got 73712 in 5 ms
solver_test.go:30: Problem_size_14 test got 365596 in 28 ms
solver_test.go:30: Problem_size_15 test got 2279184 in 151 ms
solver_test.go:30: Problem_size_16 test got 14772512 in 962 ms
--- PASS: TestSolver (1.15s)
I've written a single-threaded implementation which takes only 9.91s to complete the test, and that worries me...
My computer has 16 logical CPUs so the maximum concurrent goroutines running in both solutions are 32(I've set MAX_THREADS = uint64(runtime.NumCPU() * 2)
). Given that workers are created dynamically in the first solution, I counted the total amount of workers created during the tests. The number is just n * n
as implied in the getWorkingLevel()
function, which I think the overhead of managing goroutines is not the major problem(am I wrong?).
What is happening and why the efficiency differs greatly between the two solutions?
Why is the first solution even slower than the single-threaded version?