1

I have written a multi-goroutine version's mergeSort by go, and I also wrote a benchmark test. Now I want to use "go tool pprof" to analyze the bottleneck of my code.

When I got the cpu profile, I use "top10" in pprof to get the following output:

Showing nodes accounting for 4.73s, 98.54% of 4.80s total
Dropped 21 nodes (cum <= 0.02s)
Showing top 10 nodes out of 30
      flat  flat%   sum%        cum   cum%
     3.66s 76.25% 76.25%      3.66s 76.25%  pingcap/talentplan/tidb/common/alg/sort.Merge
     0.62s 12.92% 89.17%      0.64s 13.33%  pingcap/talentplan/tidb/mergesort.prepare
     0.17s  3.54% 92.71%      0.17s  3.54%  runtime.freedefer
     0.12s  2.50% 95.21%      0.14s  2.92%  pingcap/talentplan/tidb/common/alg/sort.quickSort
     0.10s  2.08% 97.29%      0.10s  2.08%  runtime.memmove
     0.03s  0.62% 97.92%      0.03s  0.62%  runtime.memclrNoHeapPointers
     0.03s  0.62% 98.54%      0.04s  0.83%  runtime.stackpoolalloc
         0     0% 98.54%      0.11s  2.29%  pingcap/talentplan/tidb/common/alg/sort.MergeSortByMultiGoroutine
         0     0% 98.54%      0.14s  2.92%  pingcap/talentplan/tidb/common/alg/sort.QuickSort
         0     0% 98.54%      4.04s 84.17%  pingcap/talentplan/tidb/common/alg/sort.mergeSortByMultiGoroutine

From the above, I think that the curious bottleneck is in sort.Merge, so I use "list Merge" to dive into the method, and I find the following info:

     .          .     50:func Merge(arr []int64, start int, mid int, end int, tmp []int64)  {
     .          .     51:   index, i, j := start, start, mid + 1
  80ms       80ms     52:   for ; i <= mid && j <= end; index++ {
 180ms      180ms     53:           if arr[i] <= arr[j] {
 1.58s      1.58s     54:                   tmp[index] = arr[i]
  50ms       50ms     55:                   i++
     .          .     56:           } else {
 1.52s      1.52s     57:                   tmp[index] = arr[j]
  20ms       20ms     58:                   j++
     .          .     59:           }
     .          .     60:   }
     .          .     61:   for ; i <= mid; index++ {
     .          .     62:           tmp[index] = arr[i]
     .          .     63:           i++
     .          .     64:   }
     .          .     65:   for ; j <= end; index++ {
     .          .     66:           tmp[index] = arr[j]
     .          .     67:           j++
     .          .     68:   }
     .          .     69:
  60ms       60ms     70:   for i := start; i <= end; i++ {
 170ms      170ms     71:           arr[i] = tmp[i]
     .          .     72:   }
     .          .     73:}

What confuse me is here! In the Merge method, there has 4 for-loop. The 1st for-loop and the 4th for-loop's scale is approximately the same, and both of them's task are moving elements from a slice to another slice. And the problem is that why the 1st for-loop cost so much(1.58s plus 1.52s), but the 4th for-loop cost so few(just 170ms)? It's counter-intuitive!

This project's github address is https://github.com/Duncan15/talent-plan/tree/master/tidb/mergesort. You can use "make pprof" to run the benchmark test and get the cpu profile and memory profile.

I want to know why this happen, if you have time, please read my code and give me some advise.

thank you for telling me!!!

I have write some code to verify that when the Merge method run in single-goroutine environment the 1st for-loop's cost is approximately the same with the 4th-for-loop, this seems intuitive. So I think whether the multi-goroutine environment cause the above phenomenon. But in multi-goroutine environment, the Merge method run concurrently. In other word, the 1st for-loop and 4th for-loop run concurrently, if read and write slice concurrently would increase the cost, the 4th for-loop's cost must also increase, but from the pprof output we can find that only the 1st for-loop's cost increase!

And I also write another test to verify my thought. You can use "make vet" to run it. this test run the Merge method concurrently. The difference with the multi-goroutine version's mergeSort is that this test has no code about sorting, just merely code about merging. And I surprisingly find that in this test the 1st for-loop's cost is approximately the same with the 4th for-loop's. So finally I am confused completely!

cwcing
  • 101
  • 1
  • 3
  • I think you're misreading the profile output. pprof is a sampling profiler, and cannot tell you how long something took, it's just adding up where in the code the samples were taken. 1.58s doesn't mean that operation takes that long, it means the program was at that point 1.58s out of the sampled time. – JimB Apr 24 '19 at 17:38
  • 1
    Try counting how many times you enter loop 1 and loop 4, maybe it's just that you're hitting the inside of loop 1 many more times than the inside of loop 4. – tzachs Apr 24 '19 at 18:06
  • Yes I know pprof is sampling. But I can't agree with you. If as you said, then I think cpu profile is useless. And in your meaning, then code in a method, the following code's time must be bigger than the up's. And the fact is not that. I find how to use pprof from , and it prove my opinion. I think that though 1.58s is not exact by sampling, but it must be a representation of operation time to a certain extent. Thank you for providing more advise. @JimB – cwcing Apr 24 '19 at 18:17
  • But from the code's logic in the Merge method I should speak that you thinking is impossible. the number of hitting the inside of loop 1 only can be lower than or equal to that of loop 4. @tzachs – cwcing Apr 24 '19 at 18:21
  • @cwcing you should try it regardless to rule it out as an option. – tzachs Apr 24 '19 at 18:23
  • Ok, as you said, I try it by record how many time enter loop1 and loop4 during a benchmark test, and the result is the same as my thinking, the number of loop1 is lower than that of loop4. So things return to the beginning... @tzachs – cwcing Apr 25 '19 at 02:59
  • Looking at the code, this is mostly because of cache invalidation and memory bandwidth. The initial loop (especially with 16Mb arrays) is reading from very different parts of contiguous memory, which means that each read needs to flush what the other read just pulled into the cache. The latter loop is reading sequentially, which is cache friendly and easily predictable by the CPU. – JimB Apr 25 '19 at 15:56
  • @JimB Thank you very much. I think your explain must be a very possible reason of my code's phenomenon. – cwcing Apr 25 '19 at 17:25

0 Answers0