9

I run the code in parallel in the following fashion:

grouped_data = Parallel(n_jobs=14)(delayed(function)(group) for group in grouped_data)

After the computation is done I can see all the spawned processes are still active and memory consuming in a system monitor:

enter image description here

And all these processes are not killed till the main process is terminated what leads to memory leak. If I do the same with multiprocessing.Pool in the following way:

pool = Pool(14)
pool.map(apply_wrapper, np.array_split(groups, 14))
pool.close()
pool.join()

Then I see that all the spawned processed are terminated in the end and no memory is leaked. However, I need joblib and it's loky backend since it allows to serialize some local functions.

How can I forcefully kill processes spawned by joblib.Parallel and release memory? My environment is the following: Python 3.8, Ubuntu Linux.

Ivan Sudos
  • 1,423
  • 2
  • 13
  • 25
  • 2
    Аs a temporary solution I've explicitly iterated all over subprocesses of main process and kill them. It is not good solution, since in the general case I can kill processes unrelated to joblib.Parallel. Also, loky output some scary logs to stdout and I haven't found a way to supress it. – Ivan Sudos May 12 '21 at 18:22
  • Any updates on this issue? I've the same problem. – Petras Purlys Jun 05 '21 at 09:32
  • 1
    @PetrasPurlys yes, I've investigated a bit deepy joblib code and it seems that keeping the processes alive is intentional solution for loky backend - they represent something like implicit pool and await for another invokation. You can adhere my temporary solution (see above) and enhance it with tracking of spawned processes PIDs. However I think the correct way is to ensure that everything is cleaned up within each process before completion. In that way it won't consume memory after completion. – Ivan Sudos Jun 05 '21 at 10:40
  • Also as I get it loky allows to share some memory among all the processes and main process. Thus you can easily do memory leak, for example by opening pyplot figure using global pyplot variable in the process and not closing the figure after it. I think the same stuff can be done with database connection, http session and so on. – Ivan Sudos Jun 05 '21 at 10:43
  • Same problem here. Can you share some code for track and kill the process spawned by joblib? I've found this question, but not seems do anything https://github.com/joblib/joblib/issues/945 – Gilian Jun 05 '21 at 14:53
  • @Gilian I've written an answer below summarizing everything. Hope it helps – Ivan Sudos Jun 06 '21 at 15:21

2 Answers2

11

What I can wrap-up after invesigating this myself:

  1. joblib.Parallel is not obliged to terminate processes after successfull single invocation
  2. Loky backend doesn't terminate workers physically and it is intentinal design explained by authors: Loky Code Line
  3. If you want explicitly release workers you can use my snippet:
    import psutil
    current_process = psutil.Process()
    subproc_before = set([p.pid for p in current_process.children(recursive=True)])
    grouped_data = Parallel(n_jobs=14)(delayed(function)(group) for group in grouped_data)
    subproc_after = set([p.pid for p in current_process.children(recursive=True)])
    for subproc in subproc_after - subproc_before:
        print('Killing process with pid {}'.format(subproc))
        psutil.Process(subproc).terminate()
  1. The code above is not thread/process save. If you have another source of spawning subprocesses you should block it's execution.
  2. Everything is valid for joblib version 1.0.1
Ivan Sudos
  • 1,423
  • 2
  • 13
  • 25
2

So, taking into account point 2 of Иван Судос's answer, would it be judicious to create a new class wrapped around the class LokyBackend and which overloads the terminate() function? e.g.,

class MyLokyWrapper(LokyBackend):

    def terminate(self):
        if self._workers is not None:            
            self._workers.terminate(kill_workers=False)
            #if kill_workers, joblib terminates "brutally" the remaining workers 
            #and their descendants using SIGKILL
            self._workers = None

        self.reset_batch_stats()
rbtlm640
  • 79
  • 5