0

I hava a .slurm file which can be run in Linux GPU Cluster. The file is like:

#!/bin/bash
#SBATCH -o ./myrepo/output.log
#SBATCH -J jobname
#SBATCH --gres=gpu:V100:1
#SBATCH -c 5
source /home/LAB/anaconda3/etc/profile.d/conda.sh
conda activate cuda9.1
CUDA_VISIBLE_DEVICES=0 python train.py

Now I want add a folder in log path. Maybe I will look like:

#!/bin/bash
#SBATCH -o ./myrepo/**currenttime**/output.log
#SBATCH -J jobname
#SBATCH --gres=gpu:V100:1
#SBATCH -c 5
source /home/LAB/anaconda3/etc/profile.d/conda.sh
conda activate cuda9.1
CUDA_VISIBLE_DEVICES=0 python train.py

I have tried:

#!/bin/bash
time=`date +%Y%m%d-%H%M%S`
#SBATCH -o ./myrepo/${time}/output.log
#SBATCH -J jobname
#SBATCH --gres=gpu:V100:1
#SBATCH -c 5
source /home/LAB/anaconda3/etc/profile.d/conda.sh
conda activate cuda9.1
CUDA_VISIBLE_DEVICES=0 python train.py

But failed. It seems that #SBATCH should be next to #!/bin/bash.

And the follow one succeeds, but with it I can't run more than one job at one time.

#!/bin/bash
#SBATCH -o ./myrepo/output.log
#SBATCH -J jobname
#SBATCH --gres=gpu:V100:1
#SBATCH -c 5
source /home/LAB/anaconda3/etc/profile.d/conda.sh
conda activate cuda9.1
time=`date +%Y%m%d-%H%M%S`
CUDA_VISIBLE_DEVICES=0 python train.py
cp ./myrepo/output.log ./myrepo/${time}/output.log

How can I solve this problem?

  • Use the JobID (`%j`) in your standard output instead of the current time. Like in the default standard output. Or just use the default one. – Poshi Aug 28 '19 at 07:43

1 Answers1

1

It works for me.

#!/bin/bash
#SBATCH -o ./myrepo/output_%j.log
#SBATCH -J jobname
#SBATCH --gres=gpu:V100:1
#SBATCH -c 5
time=`date +%Y%m%d-%H%M%S`
mkdir ./myrepo/${time}
source /home/LAB/anaconda3/etc/profile.d/conda.sh
conda activate cuda9.1
CUDA_VISIBLE_DEVICES=0 python train.py
mv ./myrepo/output_$SLURM_JOB_ID.log ./myrepo/${time}/output.log

#SBATCH -o ./myrepo/output_%j.log means that your output file is named after output_jobid.log, in SBATCH you can use %j to replace jobid. But in bash, you have to use $SLURM_JOB_ID, and the last line is to move the log into folder(current time). In this way you can run more than one jobs and results are in separate folders.

  • I think it will be easier for you if you name all the files with the same nomenclature. Use the JobID everywhere instead of JobID in one place and timestamp in another. That way, it will be easier to link all related files together. – Poshi Aug 28 '19 at 08:46
  • The answer will be useful to others only if you explain your solution. You could add, for example something like " `%j` will provide the JobID which can be used to name the output file" etc. – j23 Aug 28 '19 at 08:47
  • 1
    @j23 You're right. More explanations will be better. `#SBATCH -o ./myrepo/output_%j.log` means that your output file is named after "output_jobid.log", in SBATCH you can use `%j` to replace jobid. But in bash, you have to use `$SLURM_JOB_ID`, and the last line is to move the log into folder(current time). In this way you can run more than one job and result is in separate folder. – niuyuhang03 Aug 28 '19 at 09:14