I am trying to checkpoint jobs being handled by the torque job scheduler using the Berkeley Lab checkpointing (BLCR) scheme and I am having errors thrown when attempting cr_run 'my_exec' because I believe that the executable was statically linked at compile time. The submit script looks like (simplified, pseudo-version):
#!/bin/bash
#PBS -q workq
#PBS -l nodes=1:ppn=4
#PBS -l pmem=1gb,pvmem=2gb
#PBS -l walltime=30:00:00
#PBS -o out.log
#PBS -N jobname
#PBS -j oe
cd $PBS_O_WORKDIR
NNODES=$(uniq $PBS_NODEFILE | wc -l)
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
echo PBS_NODEFILE is $PBS_NODEFILE
echo NNODES is $NNODES
cat $PBS_NODEFILE
cr_run 'executable' infile.inp > outfile.out &
## store process ID as variable and sleep 29 hours, then checkpoint
BGPID=$!
sleep 104400
cr_checkpoint -p $BGPID -f checkFile.checkpoint --term
I have had success checkpointing jobs using binaries which were dynamically linked (mainly executables built from code that I wrote myself) so I already know how to do this. The problem is that the executable that I am trying to run is pre-compiled and I do not have the source code or this would not be an issue.
I found documentation here (see 4.2) that seems to offer some advice, but before trying to decipher and test the suggestions here I thought it would be worth it to see if anyone has experience with checkpointing jobs which run from an executable that is not dynamically linked at compile time.
As a side note, the code does not have internal checkpointing. Also, we are using a more courteous way of checkpointing than sleeping 29 hours, I just included this to not clutter up the script and make it more readable.