Slurm Exit Code 9: too much time between signals elapsed (32 seconds) - job killed

Question

I am bootstrapping a panel of 2.7M using reghdfe and ppmlhdfe. I am using the Picotte cluster at Drexel, as this is computationally infeasible otherwise.

When I run this, my job is killed because 1 of the iterations to compute an estimator takes longer than 30 sec. How do I change the killwait parameter in Slurm? (script below)

My simple script is:

srun --nodes=1 --ntasks=1 --mem=174gb 
module load stata/mp48/17

stata-mp -b do "/ifs/groups/fichGrp/5c_inventor_level_bootstrap.do"

Are you sure `KillWait` is what you need to change? This affects "The interval, in seconds, given to a job's processes between the SIGTERM and SIGKILL signals upon reaching its time limit." (see https://slurm.schedmd.com/slurm.conf.html#OPT_KillWait). If your job has reached its timelimit, can't you just raise that? — ciaron, Jun 22 '23 at 08:05
Maybe? this is my error code message: "srun: Job step aborted: Waiting up to 32 seconds for job step to finish." It occurs after about 30 minutes though, so I'm not sure it's my allotted job time (e.g. --time=24:00:00, etc.) I'm really new to this cluster computing stuff. — Torin McFarland, Jun 22 '23 at 13:11
It sounds like your job is aborting for some other reason than time-limit. That "Waiting for up to 32 seconds" is only letting you know that any remaining processes will be killed. You'll need to look into your logfiles or application output to find out why your job is stopping early in the first place. — ciaron, Jun 22 '23 at 14:56
Okay, thank you, I'll try to figure this out. When it kills it, it doesn't produce a log, so I guess I'll trial and error some more. — Torin McFarland, Jun 22 '23 at 18:18

Slurm Exit Code 9: too much time between signals elapsed (32 seconds) - job killed

0 Answers0