Parallel version much slower than the serial one in golang

Question

I'm trying to code a parallel version of a simple algorithm that takes a point and a list of point and find which is the point of the list closer to the first one, to compare execution times with the serial version. The problem is that running the parallel version needs more than 1 minute, while the serial version need around 1 seconds.

To be sure that the parallelism effect is noticeable I'm testing the code using a list of around 12 millions of points.

My cpu details:

Model name: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
CPU(s): 4

Here are the two versions:

Common part:

type Point struct {
    X float64
    Y float64
}

func dist(p, q Point) float64 {
    return math.Sqrt(math.Pow(p.X-q.X,2)+math.Pow(p.Y-q.Y,2))
}

Sequential function:

func s_argmin(p Point, points_list []Point, i,j int)(int){
    best := 0
    d := dist(p, points_list[0])
    var new_d float64
    for k:=i;k<j+1;k++{
        new_d = dist(p, points_list[k])
        if new_d < d{
            d = new_d
            best = k
        }
    }
    return best
}

Parallel function:

func p_argmin(p Point, points_list []Point, i,j int)(int){
    if i==j{
        return i
    }else{
        mid := int((i+j)/2)
        var argmin1, argmin2 int
        c1 := make(chan int)
        c2 := make(chan int)
        go func(){
            c1 <- p_argmin(p, points_list, i, mid)
        }()
        go func(){
            c2 <- p_argmin(p, points_list, mid+1, j)
        }()
        argmin1 = <- c1
        argmin2 = <- c2
        close(c1)
        close(c2)
        if dist(p,points_list[argmin1])<dist(p,points_list[argmin2]){
            return argmin1
        }else{
            return argmin2
        }
    }
}

I also tried to limit parallelism, with a optimized function that execute the parallel version of the function only when the input size (j-i) is greater than a value, but the serial version is always the faster one.

How can improve the result of the parallel version?

Your common part is so trivial so that concurrency orchestration is so much more expensive so that it makes it useless. Relevant: https://en.wikipedia.org/wiki/Amdahl%27s_law — zerkms, Mar 09 '20 at 02:48
"How can improve the result of the parallel version?" --- split `k:=i;k — zerkms, Mar 09 '20 at 02:53
If your code is _not_ limited by the speed of a core it doesn't help adding more cores (at least not non-NUMA cores). Parallelism is not some magic spell making e.g. memory bound processes go faster. — Volker, Mar 09 '20 at 05:02

peterSO · Answer 1 · 2020-03-09T12:06:26.393

Meaningless microbenchmarks produce meaningless results.

I see no reason to believe that recursive p_argmin might be faster than s_argmin.

$ go test micro_test.go -bench=. -benchmem
goos: linux
goarch: amd64
BenchmarkS-4      946197          1263 ns/op           0 B/op          0 allocs/op
--- BENCH: BenchmarkS-4
    micro_test.go:81: 1 946197 946197
BenchmarkP-4        3477        302076 ns/op       80958 B/op        843 allocs/op
--- BENCH: BenchmarkP-4
    micro_test.go:98: 839 2917203 3477
$

micro_test.go:

package main

import (
    "math"
    "sync"
    "testing"
)

type Point struct {
    X float64
    Y float64
}

func dist(p, q Point) float64 {
    //return math.Sqrt(math.Pow(p.X-q.X, 2) + math.Pow(p.Y-q.Y, 2))
    return math.Sqrt((p.X-q.X)*(p.X-q.X) + (p.Y-q.Y)*(p.Y-q.Y))
}

func s_argmin(p Point, points_list []Point, i, j int) int {
    mbm.Lock()
    nbm++
    mbm.Unlock()

    best := 0
    d := dist(p, points_list[0])
    var new_d float64
    for k := i; k < j+1; k++ {
        new_d = dist(p, points_list[k])
        if new_d < d {
            d = new_d
            best = k
        }
    }
    return best
}

func p_argmin(p Point, points_list []Point, i, j int) int {
    mbm.Lock()
    nbm++
    mbm.Unlock()

    if i == j {
        return i
    }
    mid := int((i + j) / 2)
    var argmin1, argmin2 int
    c1 := make(chan int)
    c2 := make(chan int)
    go func() {
        c1 <- p_argmin(p, points_list, i, mid)
    }()
    go func() {
        c2 <- p_argmin(p, points_list, mid+1, j)
    }()
    argmin1 = <-c1
    argmin2 = <-c2
    if dist(p, points_list[argmin1]) < dist(p, points_list[argmin2]) {
        return argmin1
    }
    return argmin2
}

var (
    nbm int
    mbm sync.Mutex
)

func BenchmarkS(b *testing.B) {
    mbm.Lock()
    nbm = 0
    mbm.Unlock()

    points := make([]Point, 420)
    b.ResetTimer()
    for N := 0; N < b.N; N++ {
        s_argmin(points[0], points, 0, len(points)-1)
    }
    b.StopTimer()

    mbm.Lock()
    b.Log(float64(nbm)/float64(b.N), nbm, b.N)
    mbm.Unlock()
}

func BenchmarkP(b *testing.B) {
    mbm.Lock()
    nbm = 0
    mbm.Unlock()

    points := make([]Point, 420)
    b.ResetTimer()
    for N := 0; N < b.N; N++ {
        p_argmin(points[0], points, 0, len(points)-1)
    }
    b.StopTimer()

    mbm.Lock()
    b.Log(float64(nbm)/float64(b.N), nbm, b.N)
    mbm.Unlock()
}

user3666197 · Answer 2 · 2020-03-09T08:15:34.283

The costs matter ( a lot ) _{can Try-it-Online}

A pure-[SERIAL] flow of code-execution shows the negligible cost of a per-Point evaluated distance. It takes but about some 36 [ns] per Point

//  ... The   [SERIAL] flow of code-execution took      77.095 µs for       [10]
//  --------^^^^^^^^^^------------------------------------|---------------------
//  ... The [PARALLEL] flow of code-execution took     142.563 µs for       [10] Points
//  ... The [PARALLEL] flow of code-execution took     386.27  µs for      [100] Points
//  ... The [PARALLEL] flow of code-execution took    4260.941 µs for     [1000] Points
//  ... The [PARALLEL] flow of code-execution took   31455.29  µs for    [10000] Points
//  ... The   [SERIAL] flow of code-execution took     591.604 µs for    [10000] Points
//  ... The [PARALLEL] flow of code-execution took  391694.389 µs for   [100000] Points
//  ... The   [SERIAL] flow of code-execution took    6425.999 µs for   [100000] Points
//  ... The [PARALLEL] flow of code-execution took 2807615.771 µs for  [1000000] Points
//  ... The   [SERIAL] flow of code-execution took   64596.044 µs for  [1000000] Points
//                                                 |  |  | ... ns   
//                                                 |  |  +____ µs
//                                                 |  +_______ ms
//                                                 +__________  s

Given this, the costs of instantiation of go-parallel flow of execution ( a split-and-conquer ) accumulated so huge add-on costs overheads, that it will hardly get justified for any reasonably used []Point sizes here.

Even for larger []Point sizes, the very overheads here cause ~ 2807 [ns] ~ 78 x slower per-Point processing ( right due to wrong design of the costs_of_computing / costs_of_overheads.

The revised, overhead-strict Amdahl's argument ( not the original one ) is valid here ( the original formulation did not enforce people to take also the hidden add-on overhead costs into consideration and amateurs often tend to skew the Speedup expectations )

func SERIAL( aPointToSEEK Point, aListOfPOINTs []Point ){

   defer TimeTRACK(  time.Now(), "The [SERIAL] flow of code-execution", len( aListOfPOINTs ) )
//   
//            2020/03/09 07:17:54 The [SERIAL] flow of code-execution took    120.529 µs for        [1]
//            2020/03/09 07:17:28 The [SERIAL] flow of code-execution took    194.565 µs for       [10]
//            2020/03/09 07:11:28 The [SERIAL] flow of code-execution took     77.095 µs for      [100]
//            2020/03/09 07:12:16 The [SERIAL] flow of code-execution took    260.771 µs for     [1000]
//            2020/03/09 07:13:19 The [SERIAL] flow of code-execution took    591.604 µs for    [10000]
//            2020/03/09 07:13:57 The [SERIAL] flow of code-execution took   4585.917 µs for   [100000]
//            2020/03/09 07:14:33 The [SERIAL] flow of code-execution took  44317.063 µs for  [1000000]
//            2020/03/09 07:10:30 The [SERIAL] flow of code-execution took  36141.75  µs for  [1000000]
//            2020/03/09 07:15:10 The [SERIAL] flow of code-execution took 554986.415 µs for [10000000]
//            2020/03/09 07:24:10 The [SERIAL] flow of code-execution took 676098.025 µs for [10000000]
//                                                                        |  |  | ... ns   
//                                                                        |  |  +____ µs
//                                                                        |  +_______ ms
//                                                                        +__________  s

   log.Printf(       "%s got nearest aPointID# %d", "The [SERIAL] flow of code-execution", s_argmin( aPointToSEEK, aListOfPOINTs, 0, len( aListOfPOINTs ) - 1 ) )
}

Parallel version much slower than the serial one in golang

2 Answers2

The costs matter ( a lot ) can Try-it-Online

The costs matter ( a lot ) _{can Try-it-Online}