Monitor goroutine and channel leaking / starvation?

Question

I've got a number of workers running, pulling work items off a queue. Something like:

(def num-workers 100)

(def num-tied-up (atom 0))
(def num-started (atom 0))

(def input-queue (chan (dropping-buffer 200))

(dotimes [x num-workers]
  (go
    (swap num-started inc)
        (loop []
          (let [item (<! input-queue)]
            (swap! num-tied-up inc)
            (try 
              (process-f worker-id item)
              (catch Exception e _))
              (swap! num-tied-up dec))
          (recur)))
          (swap! num-started dec))))

Hopefully num-tied-up represents the number of workers performing work at a given point in time. The value of num-tied-up hovers around a fairly consistent 50, sometimes 60. As num-workers is 100 and the value of num-started is, as expected 100 (i.e. all the go routines are running), this feels like there's a comfortable margin.

My problem is that input-queue is growing. I would expect it to hover around the zero mark because there are enough workers to take items off it. But in practice, eventually it maxes out and drops events.

It looks like tied-up has plenty of head-room in num-workers so workers should be available to take work off the queue.

My questions:

Is there anything I can do to make this more robust?
Are there any other diagnostics I can use to work out what's going wrong? Is there a way to monitor the number of goroutines currently working in case they die?
Can you make the observations fit with the data?

`go` uses a fixed size thread pool of 2*NPROCS+42. Do you have 4 cores? if you want unbound or other bound then you can manually spawn threads. — ClojureMostly, Dec 11 '15 at 10:44
I thought that `go` routines were lightweight processes on top of threads, so the size of the threadpool isn't directly related to the number of `go` routines? Maybe I misunderstood that. In any case, I'm looking for an explanation of why this behaviour is happening. — Joe, Dec 11 '15 at 11:29
You would have to show us the code that's pushing values onto the input-queue. It's dropping values since the producer is faster than the (50) consumers. I'd just use a blocking buffer to sync the production/consumption. — ClojureMostly, Dec 11 '15 at 11:43
There are **100** consumers. Only 50 seem to be working. What happened to the other 50? — Joe, Dec 11 '15 at 11:50
You only have 50 `go` blocks run concurrently because of the limits of `core.async`. — ClojureMostly, Dec 11 '15 at 12:13
You mean only 50 `go` blocks can run concurrently? In that case, I don't understand how the value of `num-started` is 100? — Joe, Dec 11 '15 at 12:16

erikprice · Answer 1 · 2015-12-13T15:45:19.087

When I fix the indentation of your code, this is what I see:

(def num-workers 100)

(def num-tied-up (atom 0))
(def num-started (atom 0))

(def input-queue (chan (dropping-buffer 200))

(dotimes [x num-workers]
  (go
    (swap num-started inc)
    (loop []
      (let [item (<! input-queue)]
        (swap! num-tied-up inc)
        (try 
          (process-f worker-id item)
          (catch Exception e _))
          (swap! num-tied-up dec))
      (recur)))
  (swap! num-started dec))

)) ;; These parens don't balance.

Ignoring the extra parens, which I assume are some copy/paste error, here are some observations:

You increment num-started inside of a go thread, but you decrement it outside of the go thread immediately after creating that thread. There's a good likelihood that the decrement is always happening before the increment.
The 100 loops you create (one per go thread) never terminate. This in and of itself isn't a problem, as long as this is intentional and by design.

Remember, spawning a go thread doesn't mean that the current thread (the one doing the spawning, in which the dotimes executes) will block. I may be mistaken, but it looks as though your code is making the assumption that (swap! num-started dec) will only run when the go thread spawned immediately above it is finished. But this is not true, even if your go threads did eventually finish (which, as mentioned above, they don't).

I have no idea how it happened but the code in my Q got pretty mangled. I've fixed it. Should only have done 100 loops. — Joe, Dec 12 '15 at 00:41

score 0 · Answer 2 · edited May 23 '17 at 12:31

Work done by go routines shouldn't do any IO or blocking operation (like Thread/sleep) as all go routines share the same thread pool, which right now has a fixed size of cpus * 2 + 42. For IO bounded work, use core.async/thread.

Note that the thread pool limits the number of go routines that will be executing concurrently, but you can have plenty waiting to be executed.

As an analogy, if you start Chrome, Vim and iTunes (equivalent to 3 go routines) but you have only one CPU in your laptop (equivalent to a thread pool of size 1), then only one of them will be executing in the CPU and the other will be waiting to be executed. It is the OS the one that takes care of pausing/resuming the programs so it looks like they are all running at the same time. Core.async just does the same, but with the difference that core.async can just pause the go-routines when they hit a !.

Now to answer your questions:

No. Try/catch all exceptions is your best option. Also, monitor the size of the queue. I would reevaluate the need of core.async. Zach Tellman has a very nice thread pool implementation with plenty of metrics.
A thread dump will show you where all the core.async threads are blocking, but as I said, you shouldn't really be doing any IO work in the go-routine threads. Have a look at core.async/pipeline-blocking
If you have 4 cores, you will get a core.async thread pool of 50, which matches your observations of 50 concurrent go blocks. The other 50 go blocks are running but either waiting for work to appear in the queue or for a time-slot to be executed. Note that they all have had a chance to be executed at least once as num-started is 100.

Hope it helps.

Monitor goroutine and channel leaking / starvation?

2 Answers2