0

I have an issue with graceful exiting my slurm jobs with saving data, etc.

I have a signal handler in my program which sets a flag, which is then queried in a main loop and a graceful exit with data saving follows. The general scheme is something like this:

#include <utility>
#include <atomic>
#include <fstream>
#include <unistd.h>

namespace {
    std::atomic<bool> sigint_received = false;
}

void sigint_handler(int) {
    sigint_received = true;
}

int main() {
    std::signal(SIGTERM, sigint_handler);

    while(true) {
        usleep(10);  // There are around 100 iterations per second
        if (sigint_received)
            break;
    }

    std::ofstream out("result.dat");
    if (!out)
        return 1;
    out << "Here I save the data";

    return 0;
}

Batch scripts are unfortunately complicated because:

  • I want hundreds of parallel, low-thread-count independent tasks, but my cluster allows only 16 jobs per user
  • srun in my cluster always claims a whole node, even if I don't want all cores, so in order to run multiple processes on a single node I have to use bash

Because of it, batch script is this mess (2 nodes for 4 processes):

#!/bin/bash -l
#SBATCH -N 2
#SBATCH more slurm stuff, such as --time, etc.

srun -N 1 -n 1 bash -c '
    ./my_program input1 &
    ./my_program input2 &
    wait
' &

srun -N 1 -n 1 bash -c '
    ./my_program input3 &
    ./my_program input4 &
    wait
' &

wait

Now, to propagate signals sent by slurm, I have even a bigger mess like this (following this answer, in particular double waits):

#!/bin/bash -l
#SBATCH -N 2
#SBATCH more slurm stuff, such as --time, etc.

trap 'kill $(jobs -p) && wait' TERM

srun -N 1 -n 1 bash -c '
    trap '"'"'kill $(jobs -p) && wait'"'"' TERM
    ./my_program input1 &
    ./my_program input2 &
    wait
' &

srun -N 1 -n 1 bash -c '
    trap '"'"'kill $(jobs -p) && wait'"'"' TERM
    ./my_program input3 &
    ./my_program input4 &
    wait
' &

wait

For the most part it is working. But, firstly, I am getting error messeges at the end of output:

run: error: nid00682: task 0: Exited with exit code 143
srun: Terminating job step 732774.7
srun: error: nid00541: task 0: Exited with exit code 143
srun: Terminating job step 732774.4
...

and, what is worse, like 4-6 out of over 300 processes actually fail on if (!out) - errno gives "Interrupted system call". Again, guided by this, I guess that my signal handler is called two times - the second one during some syscall under std::ofstream constructor.

Now,

  1. How to get rid of slurm errors and have an actual graceful exit?
  2. Am I correct that signal is sent two times? If so, why, and how can I fix it?
PKua
  • 463
  • 3
  • 15

1 Answers1

0

Suggestions:

  • trap EXIT, not a signal. EXIT happens once, TERM can be delivered multiple times.
  • use declare -f to transfer code and declare -p to transfer variables to an unrelated subshell
  • kill can fail, I do not think you should && on it
  • use xargs (or parallel) instead of reinventing the wheel with kill $(jobs -p)
  • extract "data" (input1 input2 ...) from "code" (work to be done)

Something along:

# The input.
input="$(cat <<'EOF'
input1
input2
input3
input4
EOF
)"

work() {
   # Normally write work to be done.
   # For each argument, run `my_program` in parallel.
   printf "%s\n" "$@" | xargs -d'\n' -P0 ./my_program
}

# For each two arguments run `srun....` with a shell that runs `work` in parallel.
# Note - declare -f outputs source-able definition of the function.
# "No more hand escaping!"
# Then the work function is called with arguments passed by xargs inside the spawned shell.
xargs -P0 -n2 -d'\n' <<<"$input" \
      srun -N 1 -n 1 \
      bash -c "$(declare -f work)"'; work "$@"' --

The -P0 is specific to GNU xargs. GNU xargs specially handles exit status 255, you can write a wrapper like xargs ... bash -c './my_program "$@" || exit 255' -- || exit 255 if you want xargs to terminate if any of programs fail.

If srun preserves environment variables, then export work function export -f work and just call it within child shell like xargs ... srun ... bash -c 'work "$@"' --.

KamilCuk
  • 120,984
  • 8
  • 59
  • 111