3

I'm trying to implement dask on a cluster that uses SLURM. The client is successfully created and scaled, however, at the line

with joblib.parallel_backend('dask'):

the operation gets the worker timeout error and I get the following error from the slurm jobs:

/usr/bin/python3: Error while finding module specification for 'distributed.cli.dask_worker' (ModuleNotFoundError: No module named 'distributed')

I have checked to make sure that distributed has been installed on the cluster's nodes and I am able to import it into python without any issues. Does anyone know why distributed is causing issues?

ciaron
  • 1,089
  • 7
  • 15
rgswope
  • 115
  • 1
  • 9
  • Hi, interesting, not sure if this might be of interest https://github.com/dask/dask/issues/2036 – IronMan Aug 27 '20 at 23:52
  • @IronMan I actually did come across that issue, and tried doing the dask[complete] but that didn't work. – rgswope Aug 28 '20 at 04:00
  • 1
    Are you sure PYTHONPATH is positionned in the same way when logging onto the nodes and when submitting the task through srun or sbatch ? (try sbatch --export=ALL) – PilouPili Aug 28 '20 at 14:10
  • Creating a new conda environment seems to have fixed the issue. I have a feeling it had something to do with package version mismatches between workers and scheduler. Now I'm getting this error: "distributed.worker - WARNING - Heartbeat to scheduler failed distributed.worker - INFO - Connection to scheduler broken. Reconnecting..." but that seems to be a separate issue. – rgswope Aug 29 '20 at 17:45

3 Answers3

3

Making a fresh conda environment with dask[complete] seems to have worked.

rgswope
  • 115
  • 1
  • 9
0

You do not have the distributed library installed. This commonly happens for a few reasons:

  1. You did pip install dask rather than pip install dask[complete] or conda install dask

  2. You installed into a different python executable running on your machine

    I see that you're using /usr/bin/python3. To be extra safe try /usr/bin/python3 -m pip dask[complete]

  3. Your worker machines don't share the same file system as your login nodes

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • As I said in the original post, I did ensure that distributed was installed on all the nodes. It seems to have been an issue with package locations/python paths though because using a conda env solved the issue. – rgswope Aug 31 '20 at 16:35
0

I Tried all above but this one did it for me:

pip install distributed

pip install dask["complete"]

Also if your using pycharm just search these two and install them from interpreter setting

MaYSaM
  • 21
  • 3