What are some strategies to work around or debug this?
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 26.17 GB -- Worker memory limit: 32.66 GB
Basically, am just running lots of parallel jobs on single machine but but a dask-scheduler and have tried various numbers of workers. Any time I launch a large number of jobs the memory gradually creeps up over time and only goes down when I bounce the cluster.
I am trying to use fire_and_forget. Will .release() the futures help? I am typically launching these tasks via client.submit from the REPL and then terminating the REPL.
Would be happy to occasionally bounce workers and add some retry patterns if that is the correct way to use dask with leaky libraries.
UPDATE:
I have tried limited worker memory to 2 GB, but am still getting this error. When the error happens it seems to go into some sort of unrecoverable loop continually printing the error and no compute happens.