1

I am trying to checkpoint jobs being handled by the torque job scheduler using the Berkeley Lab checkpointing (BLCR) scheme and I am having errors thrown when attempting cr_run 'my_exec' because I believe that the executable was statically linked at compile time. The submit script looks like (simplified, pseudo-version):

#!/bin/bash
#PBS -q workq
#PBS -l nodes=1:ppn=4
#PBS -l pmem=1gb,pvmem=2gb
#PBS -l walltime=30:00:00
#PBS -o out.log
#PBS -N jobname
#PBS -j oe

cd $PBS_O_WORKDIR

NNODES=$(uniq $PBS_NODEFILE | wc -l)
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
echo PBS_NODEFILE is $PBS_NODEFILE
echo NNODES is $NNODES
cat $PBS_NODEFILE

cr_run 'executable' infile.inp > outfile.out &

## store process ID as variable and sleep 29 hours, then checkpoint
BGPID=$!
sleep 104400

cr_checkpoint -p $BGPID -f checkFile.checkpoint --term

I have had success checkpointing jobs using binaries which were dynamically linked (mainly executables built from code that I wrote myself) so I already know how to do this. The problem is that the executable that I am trying to run is pre-compiled and I do not have the source code or this would not be an issue.

I found documentation here (see 4.2) that seems to offer some advice, but before trying to decipher and test the suggestions here I thought it would be worth it to see if anyone has experience with checkpointing jobs which run from an executable that is not dynamically linked at compile time.

As a side note, the code does not have internal checkpointing. Also, we are using a more courteous way of checkpointing than sleeping 29 hours, I just included this to not clutter up the script and make it more readable.

tshepang
  • 12,111
  • 21
  • 91
  • 136
codeAndStuff
  • 507
  • 6
  • 19
  • Have you found an error in doing this with statically linked code? I wouldn't think it would affect things. – dbeer Oct 04 '13 at 17:14
  • No errors occur with dynamically linked code. See the link above to see an explanation of what to do when the code is statically linked at compile time. The problem is that I do not have the source code so I cannot control how the executable is linked. This was the whole problem. – codeAndStuff Oct 04 '13 at 17:41
  • My mistake - I thought you were saying you had trouble with re-starting, but you're talking about the initial compile with BLCR. Is the code one compiled locally or supplied by a vendor? If its local perhaps you can work with the site admin to get a statically linked copy that is BLCR compatible. If its from a vendor you probably need to push the vendor for the same thing but it might be harder. – dbeer Oct 04 '13 at 19:40
  • yeah, unfortunately it isn't a local vendor but we know members of the research group who wrote the software. I was hoping someone had some experience in dealing with this type of thing since it seems like it should be a somewhat common thing to have to do with larger (generally commercial) software packages with no internal checkpointing. – codeAndStuff Oct 04 '13 at 20:03
  • 1
    I hope there's a solution for you but my gut says that the software must be re-built. – dbeer Oct 04 '13 at 21:40
  • yeah, same here. Thanks though. – codeAndStuff Oct 05 '13 at 18:55

1 Answers1

1

The answer is mentioned here on the BLCR FAQ : https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#staticlink

If you can checkpoint and restart a dynamically linked application correctly, but 
cannot do so with the same application linked statically, this FAQ entry is for you.
There are multiple reasons why BLCR may have problems with statically executables.

The cr_run utility only supports dynamic executables
If you wish to checkpoint an unmodified executable, the typical recipe is

$ cr_run my_app my_args

However, the cr_run utility does its work using the "LD_PRELOAD" environment variable 
to force loading of BLCR's support code into the address space the applications. That 
mechanism is only functional for dynamically linked executables. There is no magic we 
can perform today that will resolve this (though in the future we'd like to replace 
our use of LD_PRELOAD with a kernel-side mechanism). So, you'll need to relink any 
statically linked executables to include BLCR support.

** Linking BLCR's libraries statically takes special care **
OK, we've told you why cr_run doesn't work and told you to relink. You tried linking 
with -lcr_run and/or -lcr and still can't get a checkpoint to work. What went wrong?
You need a -u option in addition the the -l or the static linking will simply ignore 
BLCR's library.

** BLCR doesn't support LinuxThreads **
Ok, what else could go wrong? You've followed the guidance given in the "Cautionary
linker notes" section of the BLCR Users Guide when you linked your application. You 
even ran

$ nm my_app | grep link_me

to be sure the symbol you specified with -u is linked in. However, you are seeing 
weird crashes of your application when you try to checkpoint.

The culprit might be LinuxThreads. Why? Because at the time this FAQ entry is being 
written, there are many Linux distributions that install the static libs for 
LinuxThreads in the default library search path, and with the NPTL static libs 
elsewhere. The resolution could be as simple as linking your application with -L/usr
/lib/nptl or -L/usr/lib64/nptl, perhaps by setting an "LDFLAGS" variable (though it is 
possible that your distribution has picked some other location).

While it is not strictly required due to binary compatibility between LinuxThreads and 
NPTL, we'd recommend that you at least consider a recompile with -I/usr/include/nptl 
in CFLAGS.

Note, of course, that if BLCR's utilities are statically linked to LinuxThreads, then 
they need to be rebuilt too. See the BLCR Admin Guide for more information on that.
Arjun J Rao
  • 925
  • 1
  • 10
  • 25