0

I'm trying to run dask on an HPC system that uses NFS for storage. As such, I want to configure dask to use local storage for scratch space. Each cluster node has a /scratch/ folder that all users can write into, with instructions to put scratch files in /scratch/<username>/<jobid>/.

I have a bit of code configured as such:

import dask_jobqueue
from distributed import Client

cluster = dask_jobqueue.SLURMCluster(
            queue = 'high',
            cores = 24,
            memory = '60GB',
            walltime = '10:00:00',
            local_directory = '/scratch/<username>/<jobid>/'
)

cluster.scale(1)
client = Client(cluster)

However, I have a problem. The directory doesn't exist (both because I don't know what node the client will be on, and because it creates it based on the SLURM job id, which is always unique) and so my code fails with:

Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/home/lsterzin/anaconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/lsterzin/anaconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/process.py", line 191, in _run
    target(*args, **kwargs)
  File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/nanny.py", line 699, in _run
    worker = Worker(**worker_kwargs)
  File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/worker.py", line 497, in __init__
    self._workspace = WorkSpace(os.path.abspath(local_directory))
  File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/diskutils.py", line 118, in __init__
    self._init_workspace()
  File "/home/lsterzin/anaconda3/lib/python3.7/site-packages/distributed/diskutils.py", line 124, in _init_workspace
    os.mkdir(self.base_dir)
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/<user>/<jobid>'

I can't create the directory without knowing what node the dask workers will be running on, and I can't create the cluster with dask_jobqueue without the directory already existing. What's the best way to work around this?

lsterzinger
  • 687
  • 5
  • 22

2 Answers2

1

Thank you for the well phrased question @lsterzinger

I've pushed up a fix here that might help: https://github.com/dask/distributed/pull/3928

We'll see what the community says

MRocklin
  • 55,641
  • 23
  • 163
  • 235
-1

I think this can be done with /scratch/$USER/$SLURM_JOB_ID. If that doesn't work perhaps define the local-directory through the configuration file: https://jobqueue.dask.org/en/latest/configuration-setup.html#local-storage

The example configurations might be useful for you as well: https://jobqueue.dask.org/en/latest/configurations.html

quasiben
  • 1,444
  • 1
  • 11
  • 19
  • The problem is that the directory I need to use as scratch space doesn't exist, but I can't create it ahead of time because I don't know the node the workers will be running on – lsterzinger Jun 27 '20 at 15:54
  • The confusion may have come because I replaced my actual user/jobid in the question with `` and `` – lsterzinger Jun 27 '20 at 17:01