0

I am using the SLURM_TMPDIR in ComputeCanada to do some intensive I/O operations, like cloning large repositories, analyzing their commit histories, etc. But now when the job runs out of the assigned time, I lose my output file inside SLURM_TMPDIR. I read about signal trapping here. But since I am not that experienced in System programming, maybe my understanding is not very accurate and hence I can't achieve what I intend to. Here is my batch job script but it doesn't trap and copy the output to my desired location.

#!/bin/bash
#SBATCH --mem=128G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --time=0:10:0   
#SBATCH --signal=B:SIGUSR1@120

output_file_name=file_0000.jsonl
echo "Start"

function handle_signal() 
{
    echo 'Moving File'
    cp $SLURM_TMPDIR/<output_file_path> <my_compute_canada_directory>
    exit 2
}

trap 'handle_signal' SIGUSR1


cd $SLURM_TMPDIR
git clone ...

cd ...

module purge

module load java/17.0.2
module load python/3.10

export JAVA_TOOL_OPTIONS="-Xms256m -Xmx5g"

python -m venv res_venv
source .venv/bin/activate
pip install -r requirements.txt

python data_collector.py ./data/file_0000.csv $output_file_name

wait

echo "Test"

exit 0

But it doesn't even print 'Moving File'. Can someone please guide me on how to efficiently use Signal Trap in SLURM_TMPDIR? It should copy the specified file if the job runs out of the assigned time and also should copy if my python script is done executing? Thanks!

not-a-bot
  • 25
  • 3

1 Answers1

2

It seems that you need to be running srun for the signal to be sent:

Outside srun:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:01:00   
#SBATCH --signal=B:SIGUSR1@50

trap 'echo SIGUSR1 1>&2' SIGUSR1

srun sleep 1
dd if=/dev/zero of=/dev/null 2>/dev/null

Result:

slurmstepd: error: *** JOB 25752715 ON node-2017 CANCELLED AT 2023-05-27T23:47:06 DUE TO TIME LIMIT ***

During srun

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=00:01:00   
#SBATCH --signal=B:SIGUSR1@50

trap 'echo SIGUSR1 1>&2' SIGUSR1

srun dd if=/dev/zero of=/dev/null 2>/dev/null

Result:

slurmstepd: error: *** JOB 25752755 ON node-2014 CANCELLED AT 2023-05-28T00:01:06 DUE TO TIME LIMIT ***
SIGUSR1
Fravadona
  • 13,917
  • 1
  • 23
  • 35
  • 1
    Yes, this is also a way. But I guess for my use case just running the python script in background with & at the end of the python command solved the issue. It can trap signals only if the script is running in background or as a separate process (like srun). – not-a-bot May 29 '23 at 18:56