11

I am only using the basic joblib functionality:

Parallel(n_jobs=-1)(delayed(function)(arg) for arg in arglist)

I am frequently getting the warning:

UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

This tells me that one possible cause is a too short worker timeout. Since I did not set a worker timeout and default is None, this cannot be the issue. How do I go about finding a memory leak? Or is there something I can do to avoid this warning? Did some parts not get executed? Or should I just not worry about this?

cmosig
  • 1,187
  • 1
  • 9
  • 24
  • 3
    have you looked at [this issue](https://github.com/joblib/joblib/issues/883)? One person did mention that they didn't see any issues apart from the warning – Oliver Ni May 28 '20 at 09:14
  • that is a good hint + good news. Thank you. I guess I will try and come up with a minimal example. – cmosig May 28 '20 at 09:17
  • yep! Remember googling the error message can return very useful results! – Oliver Ni May 28 '20 at 09:21
  • Are you on an AMD CPU? I'll try to find it now but I found a thread somewhere on another site where others were having this same issue with virtual threads on AMD CPUs and no one could figure out why. Running n_jobs only on my physical cores made the warning go away. – user2415706 Jun 04 '20 at 02:28
  • Thanks for answer, but I am running on intel :( : `i7-8565U`, `Xeon(R) Gold 6152`, and `Xeon(R) CPU E5-2680`. I will try limiting `n_jobs` to the #processors. Maybe that helps. – cmosig Jun 04 '20 at 13:11

1 Answers1

2

To fix, increase timeout, I used this:

# Increase timeout (tune this number to suit your use case).
timeout=99999
result_chunks = joblib.Parallel(n_jobs=njobs, timeout=timeout)(joblib.delayed(f_chunk)(i) for i in n_chunks)

Note that this warning is benign; joblib will recover and results are complete and accurate.

See a more detailed answer here.

Contango
  • 76,540
  • 58
  • 260
  • 305
  • thanks for the answer! Not ready to accept though, since you write in the referenced post that it also happens with simply high CPU utilization, so increasing the timeout wouldn't fully fix the issue. I also feel like tweaking n_jobs somehow randomly isn't really the way to go. I would much rather just ignore the warning instead of dropping performance. – cmosig Apr 25 '22 at 14:17
  • 1
    Commenting on this to note that `joblib` does actually recover after the said warning. – NelsonGon Oct 12 '22 at 10:30
  • 2
    @NelsonGon Here we go - https://github.com/scikit-learn/scikit-learn/issues/14626#issuecomment-520659817 – jtlz2 Jan 10 '23 at 09:40