I have a question for a very specific use case. I'll start by giving a bit of background:
I am trying to train a deep learning model in keras and want to do 10 fold cross validation to check training stability of the model. Usually I create snakemake workflows and execute them on a slurm cluster. Due to limited GPU nodes, I would like to checkpoint my model, stop the job and resubmit once in a while to not block the GPUs. The goal of this would be to train the model iteratively with short running jobs.
Now to my questions:
- Is there a way to resubmit a job a certain number of times/until a condition is met?
- Is there another clever way to train a model iteratively without having to manually submit the job?