How can I save output files on storage disk on cluster using slurm?

Question

I have a Fortran code that I have to run on a cluster with slurm. I have compiled the code in the home directory (which is mounted in all the cluster nodes) and always ran on it. However, the partition where the home is mounted has only 250 GB or so. I have to run many different simulations that generate many output files and so they get easily heavy and me and my colleagues are always facing memory issues (we have to stop simulations, move the files manually and restart them). We move them in a secondary disk with 5 TB memory.

I was wondering if there is a way to run the simulations with sbatch on the home directory and save all the output files in the secondary disk (which is not shared between all the nodes). I tried with the --output flag but it is not working.

The bash script I run with sbatch is simple and it is the following:

#!/bin/bash
#SBATCH --partition=cpu
#SBATCH --job-name=k1_01
#SBATCH --mem=16G
#SBATCH --time=90-0:0
#SBATCH --output=output.log
#SBATCH --nodelist=node13
./program < input.in

FYI program generates many output files: some are updated every iteration of the main loop inside the code and some others are generated new, one for each step (I have 2000 steps).

Thanks for your help

score 0 · Answer 1 · answered Sep 27 '22 at 16:56

If the program is coded so as to write the temporary files in the current working directory, you can simply change the directory with the cd command.

Let's assume the secondary disk is mounted on the compute nodes in /scratch (your cluster might define an environment variable pointing to the right location e.g. $LOCALSCRATCH, $TMP, $TMPDIR, etc. In that case, replace /scratch with that variable in the script below)

Then your submission script could look like this:

#!/bin/bash
#SBATCH --partition=cpu
#SBATCH --job-name=k1_01
#SBATCH --mem=16G
#SBATCH --time=90-0:0
#SBATCH --output=output.log
#SBATCH --nodelist=node13

SCRATCH=/scratch/$USER/$SLURM_JOBID
mkdir -p $SCRATCH && cd $SCRATCH

$SLURM_SUBMIT_DIR/program < input.in

cp output.log final.res $SLURM_SUBMIT_DIR && rm -rf $SCRATCH

The script first defines a $SCRATCH variable based on your username and Slurm current JobID. The computation will take place in that directory located on the secondary disk. (If the disk is mounted somewhere else than /scratch, replace that part with the correct location)

It then creates the directory referred to by the variable and changes the working directory there. This way data are organised properly in the temporary disk

As we changed the directory, program must be referred to by an absolute path. The $SLURM_SUBMIT_DIR variable holds the path where the sbatch command was run. So as long as you run sbatch in the same directory as the one where the submission script and the program software reside, $SLURM_SUBMIT_DIR/program` should correctly point to it.

Then you will want to copy the output and the results (I invented a result file named final.res, make sure of course to replace it with the list of files you need to keep) back to the home directory, again using the $SLURM_SUBMIT_DIR variable.

Finally, note the rm -rf $SCRATCH part ; it will remove files related to the current job in the secondary disk to clean it properly. You can remove it if you want the files to stay there.

This works perfectly but just for the node 1, where the secondary disk is mounted. If I submit the jobs in another node it doesn't work. I assume it happens because this storage disk is mounted only in node 1. My question is: is it possible to submit the job on another node, transfer them back to node 1 and directly on the storage disk without occupying memory in the home? Already tried to launch this script ```#!/bin/bash #SBATCH --partition=cpu ... #SBATCH --nodelist=node14 ./program > /scratch/output.out ``` from /scratch/ but it is not working (can't see /scratch from node 14) — riky_cv, Sep 30 '22 at 09:04
It is a question better asked a the system administrator of that cluster, there are technical solutions, I am not aware of which ones, if any, are in place in your infrastructure — damienfrancois, Sep 30 '22 at 10:57
Yeah, I imagined that... I already talked with the IT guys, but they seem not to know that much. I think they have to ask another company to do that job, and it will take weeks probably. I need to start simulations asap, so I wondered if there was another way to do so... Nevermind. Thanks anyway! — riky_cv, Oct 03 '22 at 07:07

How can I save output files on storage disk on cluster using slurm?

1 Answers1