0

I'm training several Neural Networks on a server in my University. Due to limited resources for all the students, there is a job scheduling system called (Slurm) that queues all students runs and in addition, we are only allowed to run our commands with a time limit (24h). Once exceed this processing time, our running process is closed to give resource availability to the others.

Specifically, I'm training GAN's and I need more training time than 24h. Right now, I'm saving the checkpoints of my model to restart from the same training point before the process closure. But, I must execute the same command again after 24h.

For this reason I would like to schedule this execution every 24h automatically.

Currently I'm using 'tmux' to execute the command and be able to close the terminal.

Some suggestion on how to automate this kind of execution?

Thank you in advance!

mgrau
  • 51
  • 5

1 Answers1

1

You can setup your job to automatically resubmit when it's close to the timelimit.

Note that slurm's time granularity is 1 minute, so don't set the signal timer to anything less than 60 seconds.

#!/bin/bash
#SBATCH --signal=B:USR1@300  # Tell Slurm to send signal USR1 300 seconds before timelimit 
#SBATCH -t 24:00:00
resubmit() {
  echo "It's time to resubmit";  # <----- Run whatever is necessary. Ideally resubmit the job using the checkpointing files
  sbatch ...
}

trap "resubmit" USR1 # Register signal handler

YOUR_TRAINING_COMMAND & # It's important to run on the background otherwise bash will not process the signal until this command finishes 

wait  # wait until all the background processes are finished. If a signal is received this will stop, process the signal and finish the script.
Carles Fenoy
  • 4,740
  • 1
  • 26
  • 27