0

I have a question for a very specific use case. I'll start by giving a bit of background:

I am trying to train a deep learning model in keras and want to do 10 fold cross validation to check training stability of the model. Usually I create snakemake workflows and execute them on a slurm cluster. Due to limited GPU nodes, I would like to checkpoint my model, stop the job and resubmit once in a while to not block the GPUs. The goal of this would be to train the model iteratively with short running jobs.

Now to my questions:

  1. Is there a way to resubmit a job a certain number of times/until a condition is met?
  2. Is there another clever way to train a model iteratively without having to manually submit the job?

2 Answers2

0

For this, you need to submit job with command

llsubmit job.sh

The shell script or batch job file should be executed as manytimes. Once the job finishes, resources are available. it restarts the same script(you already submitted and waiting in queue) automatically.

ML85
  • 709
  • 7
  • 19
0

Here are a few suggestions:

  • Just train your network. It's up to the scheduler to try not to block the GPUs and running 10 short jobs vs 1 long job will probably lead to the same priority.
  • You can specify --restart-times to run a job which has failed multiple times. The trick is that snakemake will also remove outputs from failed jobs. The workaround is to checkpoint your model to a temp file (not in the output directive of the rule) and exit your training with an error to signal to snakemake that it needs to run again. The inelegant part is that you have to set your restart to a large value, or make sure your training code knows that it is running the final attempt and needs to save the actual output. You can acquire the attempt as a resource. I'm not sure the parameter is available in other directives. Also any job that fails will be resubmitted; not a great option for development.
  • You can make your checkpoint files outputs. This again assumes you want to run a set number of times. Your rule all will look for a file like final.checkpoint, which depends on 10.checkpoint, which depends on 9.checkpoint and so on. With a fancy enough input function this can be implemented in one rule where 1.checkpoint depends on nothing (or your training data perhaps).
Troy Comi
  • 1,579
  • 3
  • 12