I am running a job that requires a large memory on a cluster using Slurm. I used the flags --output
to save the system output. This would successfully save the system output if the job finishes without error.
However, if the job encounters an out-of-memory issue on the node, any system input before the error occurred would not appear in the output.log
file. So output.log
would only contain system output after the point the error has happened.
Is there a way for Slrum to save all system output, when a job fails, to output.log
so that I can see at which point the error has occurred in the job?
Here is the batch script I am using:
#!/bin/bash -l
#SBATCH --account=qmech
#SBATCH --job-name=job
#SBATCH --exclusive
#SBATCH -C mem768
#SBATCH --mem=750gb
#SBATCH -c 32 # CPU per task
#SBATCH --time=01:00:00
#SBATCH --output=output.log
#SBATCH --error=error.log
I have looked at the Slurm documentation but am not aware there is any parameter that will achieve this.