0

I'm running dask over slurm via jobqueue and I have been getting 3 errors pretty consistently...

Basically my question is what could be causing these failures? At first glance the problem is that too many workers are writing to disk at once, or my workers are forking into many other processes, but it's pretty difficult to track that. I can ssh into the node but I'm not seeing an abnormal number of processes, and each node has a 500gb ssd, so I shouldn't be writing excessively.

Everything below this is just information about my configurations and such
My setup is as follows:

cluster = SLURMCluster(cores=1, memory=f"{args.gbmem}GB", queue='fast_q', name=args.name,
                           env_extra=["source ~/.zshrc"])
cluster.adapt(minimum=1, maximum=200)

client = await Client(cluster, processes=False, asynchronous=True)

I suppose i'm not even sure if processes=False should be set.

I run this starter script via sbatch under the conditions of 4gb of memory, 2 cores (-c) (even though i expect to only need 1) and 1 task (-n). And this sets off all of my jobs via the slurmcluster config from above. I dumped my slurm submission scripts to files and they look reasonable.

Each job is not complex, it is a subprocess.call( command to a compiled executable that takes 1 core and 2-4 GB of memory. I require the client call and further calls to be asynchronous because I have a lot of conditional computations. So each worker when loaded should consist of 1 python processes, 1 running executable, and 1 shell. Imposed by the scheduler we have

>> ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-m: resident set size (kbytes)      unlimited
-u: processes                       512
-n: file descriptors                1024
-l: locked-in-memory size (kbytes)  64
-v: address space (kbytes)          unlimited
-x: file locks                      unlimited
-i: pending signals                 1031203
-q: bytes in POSIX msg queues       819200
-e: max nice                        0
-r: max rt priority                 0
-N 15:                              unlimited

And each node has 64 cores. so I don't really think i'm hitting any limits.

i'm using the jobqueue.yaml file that looks like:

slurm:
  name: dask-worker
  cores: 1                 # Total number of cores per job
  memory: 2                # Total amount of memory per job
  processes: 1                # Number of Python processes per job
  local-directory: /scratch       # Location of fast local storage like /scratch or $TMPDIR
  queue: fast_q
  walltime: '24:00:00'
  log-directory: /home/dbun/slurm_logs

I would appreciate any advice at all! Full log is below.

FORK BLOCKING IO ERROR


distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.131.82:13687'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/dbun/.local/share/pyenv/versions/3.7.0/lib/python3.7/multiprocessing/forkserver.py", line 250, in main
    pid = os.fork()
BlockingIOError: [Errno 11] Resource temporarily unavailable
distributed.dask_worker - INFO - End worker

Aborted!

CANT START NEW THREAD ERROR

https://pastebin.com/ibYUNcqD

BLOCKING IO ERROR

https://pastebin.com/FGfxqZEk

EDIT: Another piece of the puzzle: It looks like dask_worker is running multiple multiprocessing.forkserver calls? does that sound reasonable?

https://pastebin.com/r2pTQUS4

Mr. Buttons
  • 463
  • 1
  • 3
  • 9

1 Answers1

0

This problem was caused by having ulimit -u too low.

As it turns out each worker has a few processes associated with it, and the python ones have multiple threads. In the end you end up with approximately 14 threads that contribute to your ulimit -u. Mine was set to 512, and with a 64 core system I was likely hitting ~896. It looks like the a maximum threads per a process I could have had would have been 8.

Solution: in .zshrc (.bashrc) I added the line

ulimit -u unlimited

Haven't had any problems since.

Mr. Buttons
  • 463
  • 1
  • 3
  • 9