2

I'm trying to run a very large set of batch jobs on a RHEL5 cluster which uses a Lustre file system. I was getting a strange error with roughly 1% of the jobs: they could't find a text file they are all using for steering. A script that reproduces the error looks like this:

#!/usr/bin/env bash

#PBS -t 1-18792
#PBS -l mem=4gb,walltime=30:00
#PBS -l nodes=1:ppn=1
#PBS -q hep
#PBS -o output/fit/out.txt
#PBS -e output/fit/error.txt

cd $PBS_O_WORKDIR
mkdir -p output/fit
echo 'submitted from: ' $PBS_O_WORKDIR 

files=($(ls ./*.txt | sort)) # <-- NOTE THIS LINE

cat batch/fits/fit-paths.txt

For some small fraction of jobs, the error stream output would show:

cat: batch/fits/fit-paths.txt: No such file or directory

Weird enough, but it gets stranger.


When I change the files=($(ls ./*.txt | sort)) line to

files=($(ls batch/fits/*.txt | sort))

The jobs run without errors! Needless to say, this is far from satisfying: I'd rather not have my jobs depend on black magic (although black magic is better than no magic).

Any idea what's going on here?

Shep
  • 7,990
  • 8
  • 49
  • 71
  • 1
    Best bet is to add debugging, `ls -l batch/fits/*` or similar to see what IS is that dir. Maybe also wrap that info with timestamps (are these dynamically created files that this could be a timing issue?) Add a `sleep x`, testing to see if the reduces or elimates the problem. Good luck! – shellter Aug 15 '13 at 14:08
  • somehow adding `ls batch/fits/` seems to have eliminated the problem... very weird, not very satisfying. – Shep Aug 15 '13 at 15:50
  • 1
    Is that the first line of the script that is accessing the lustre filesystem or are those other locations also network mounted? – dbeer Aug 15 '13 at 20:25
  • the `cd`, `mkdir`, and `ls` lines all use the lustre filesystem. What do you mean "network mounted"? They are all on the same filesystem. – Shep Aug 18 '13 at 10:53
  • Sorry, I meant to ask if that was the first line to access the network's filesystem or not. You answered my question. – dbeer Aug 20 '13 at 20:56

1 Answers1

0

Try replacing

files=($(ls ./*.txt | sort))

with

files=(./*.txt)

Normally, the shell automatically sorts glob results, and – in contrast to parsing ls(1) output, which should never be done in portable shell scripts – handles quoting of special characters correctly.

Although this is only an issue if you ever have files with certain shell metacharacters in them. Candidates here are space, tab, newline and possibly carriage return.

mirabilos
  • 5,123
  • 2
  • 46
  • 72