I have an issue with graceful exiting my slurm jobs with saving data, etc.
I have a signal handler in my program which sets a flag, which is then queried in a main loop and a graceful exit with data saving follows. The general scheme is something like this:
#include <utility>
#include <atomic>
#include <fstream>
#include <unistd.h>
namespace {
std::atomic<bool> sigint_received = false;
}
void sigint_handler(int) {
sigint_received = true;
}
int main() {
std::signal(SIGTERM, sigint_handler);
while(true) {
usleep(10); // There are around 100 iterations per second
if (sigint_received)
break;
}
std::ofstream out("result.dat");
if (!out)
return 1;
out << "Here I save the data";
return 0;
}
Batch scripts are unfortunately complicated because:
- I want hundreds of parallel, low-thread-count independent tasks, but my cluster allows only 16 jobs per user
srun
in my cluster always claims a whole node, even if I don't want all cores, so in order to run multiple processes on a single node I have to usebash
Because of it, batch script is this mess (2 nodes for 4 processes):
#!/bin/bash -l
#SBATCH -N 2
#SBATCH more slurm stuff, such as --time, etc.
srun -N 1 -n 1 bash -c '
./my_program input1 &
./my_program input2 &
wait
' &
srun -N 1 -n 1 bash -c '
./my_program input3 &
./my_program input4 &
wait
' &
wait
Now, to propagate signals sent by slurm, I have even a bigger mess like this (following this answer, in particular double waits):
#!/bin/bash -l
#SBATCH -N 2
#SBATCH more slurm stuff, such as --time, etc.
trap 'kill $(jobs -p) && wait' TERM
srun -N 1 -n 1 bash -c '
trap '"'"'kill $(jobs -p) && wait'"'"' TERM
./my_program input1 &
./my_program input2 &
wait
' &
srun -N 1 -n 1 bash -c '
trap '"'"'kill $(jobs -p) && wait'"'"' TERM
./my_program input3 &
./my_program input4 &
wait
' &
wait
For the most part it is working. But, firstly, I am getting error messeges at the end of output:
run: error: nid00682: task 0: Exited with exit code 143
srun: Terminating job step 732774.7
srun: error: nid00541: task 0: Exited with exit code 143
srun: Terminating job step 732774.4
...
and, what is worse, like 4-6 out of over 300 processes actually fail on if (!out)
- errno
gives "Interrupted system call". Again, guided by this, I guess that my signal handler is called two times - the second one during some syscall under std::ofstream
constructor.
Now,
- How to get rid of slurm errors and have an actual graceful exit?
- Am I correct that signal is sent two times? If so, why, and how can I fix it?