Dask distributed workers always leak memory when running many tasks

Question

What are some strategies to work around or debug this?

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 26.17 GB -- Worker memory limit: 32.66 GB

Basically, am just running lots of parallel jobs on single machine but but a dask-scheduler and have tried various numbers of workers. Any time I launch a large number of jobs the memory gradually creeps up over time and only goes down when I bounce the cluster.

I am trying to use fire_and_forget. Will .release() the futures help? I am typically launching these tasks via client.submit from the REPL and then terminating the REPL.

Would be happy to occasionally bounce workers and add some retry patterns if that is the correct way to use dask with leaky libraries.

UPDATE:

I have tried limited worker memory to 2 GB, but am still getting this error. When the error happens it seems to go into some sort of unrecoverable loop continually printing the error and no compute happens.

score 2 · Answer 1 · answered Oct 09 '19 at 00:51

2

Dask isn't leaking the memory in this case. Something else is. Dask is just telling you about it. Something about the code that you are running with Dask seems to be leaking something.

answered Oct 09 '19 at 00:51

MRocklin

55,641
23
163
235

Yes, it is likely tensorflow in this case. Looking for ways of bouncing workers when memory hits a threshold (and not the entire cluster) with some sort of retry overlay. – mathtick Oct 10 '19 at 06:35
1

You might want to look at the implementation of the `lifetime=` keyword. One could probably adapt that to do what you want. – MRocklin Oct 11 '19 at 02:04

Dask distributed workers always leak memory when running many tasks

1 Answers1

Linked