I've got a number of workers running, pulling work items off a queue. Something like:
(def num-workers 100)
(def num-tied-up (atom 0))
(def num-started (atom 0))
(def input-queue (chan (dropping-buffer 200))
(dotimes [x num-workers]
(go
(swap num-started inc)
(loop []
(let [item (<! input-queue)]
(swap! num-tied-up inc)
(try
(process-f worker-id item)
(catch Exception e _))
(swap! num-tied-up dec))
(recur)))
(swap! num-started dec))))
Hopefully num-tied-up
represents the number of workers performing work at a given point in time. The value of num-tied-up
hovers around a fairly consistent 50
, sometimes 60
. As num-workers
is 100 and the value of num-started
is, as expected 100 (i.e. all the go
routines are running), this feels like there's a comfortable margin.
My problem is that input-queue
is growing. I would expect it to hover around the zero mark because there are enough workers to take items off it. But in practice, eventually it maxes out and drops events.
It looks like tied-up
has plenty of head-room in num-workers
so workers should be available to take work off the queue.
My questions:
- Is there anything I can do to make this more robust?
- Are there any other diagnostics I can use to work out what's going wrong? Is there a way to monitor the number of goroutines currently working in case they die?
- Can you make the observations fit with the data?