4

I am using go to parallelize 2d convolutions where the convolution (implemented in go) is happening in a c-archive included in a C binary (where the go code is called). No calls are made from the go code to any c function

Before spawning goroutines, all matrixes are loaded into memory by the c code and all goroutines access it through the shared memory.

I use the GOMAXPROCS-1 to decide how many go routines to spawn and each routine is assigned a ID. The goroutines are assigned rows of the matrix based on their ID in a striped fashion. The go routines are locked to a OS thread when spawned and release the thread once finished.

e.g. if GOMAXPROCS is set to 4, goroutine 0 takes row 0, 4, 8, 12 etc and goroutine 1 takes row 1, 5, 9, 13 and so on.

My issue is that when GOMAXPROCS is set to 4, go spawns 11 OS threads

htop and atop: enter image description here

My understanding is that these OS threads are spawned because the scheduler is trying to make sure that there are always threads available that are not blocked.

There is no I/O or system calls happening after the goroutines have been spawned so I don't understand why the scheduler is creating all these processes or what is blocking the threads.

The number of threads being spawned is slowing down the execution when executing with GOMAXPROCS >=20 on a machine with 40 cores

Why is the scheduler spawning all these threads? How can I debug where/how the routines are being blocked?

Source code

Thor
  • 459
  • 4
  • 15
  • It's hard to say how many you should expect without an example, but all C calls are blocking and on a different stack, so they must happen in a different thread. – JimB Nov 28 '18 at 19:25
  • Added link to source. Do you mean that calls to C.float and C.uchar are blocking as well or calls to my own C functions? I am not calling any custom C functions from my go code – Thor Nov 28 '18 at 19:58
  • No, `C.float` and `C.uchar` are types, not callable functions. You said the convolution is happening in a `c-archive included in a C binary.`, so I assumed you were calling into that with cgo, though you're correct that your example here has no cgo calls. I would start by removing the `LockOSThread` calls (you can't use thread-local storage in Go, so there's no reason to call it), then check any other locations where you might be making cgo calls. – JimB Nov 28 '18 at 20:05
  • Thank you for the feedback. I Updated description to make that clear that no c functions are called from the go code. By removing the LockOSThread, fewer threads are created (7 instead of 11) but the execution is slower by a few seconds (from 190 seconds without locking down to 180 seconds with locking). Going to try to reduce casting with C.float/C.uchar as much as possible. – Thor Nov 28 '18 at 20:20

1 Answers1

0

GOMAXPROCS limits the number of threads running Go code, but cgo calls do not count as Go code, so you can still see multiple threads with GOMAXPROCS=1.

Jiacai Liu
  • 2,623
  • 2
  • 22
  • 42