0

I'm running batch jobs on a RHEL5 lustre filesystem. Many jobs (13k) read the same text file, which is used to direct each job to a different dataset. The code looks like this:

with open('dataset-paths.txt') as txt_file: 
    dataset_location = txt_file.readlines()[job_number].strip()

But for some fraction of my jobs, I get

IOError: [Errno 2] No such file or directory: 'dataset-paths.txt'

Is it not possible to open the same text file from multiple processes at the same time? What else could cause this?

Shep
  • 7,990
  • 8
  • 49
  • 71

3 Answers3

1

Just a "random guess", maybe to error message is just misleading?

Remember there is a limit on the number of opened files -- or to be precise of the number of file descriptors. Given the high number of process involved, it is quite possible that at some point during execution that limit is reached...

Sylvain Leroux
  • 50,096
  • 7
  • 103
  • 125
  • `cat /proc/sys/fs/file-max` gives me 829173, which is far more than the number of jobs I'm running (about 18k) – Shep Aug 15 '13 at 13:43
  • @Shep I'm not a specialist of that kind of stuff, but as far as I remember, there is a "system" limit, as well as par user/per group/per process group limit(s). On my system, `ulimit -Hn -Sn` reports only 1024 max open files. Whereas `/proc/sys/fs/file-max` is way beyond that. – Sylvain Leroux Aug 15 '13 at 15:23
1

I have no clue why that is happening, maybe locks on the file or too many open file handles. But apply this when you open/interact with your file. It basically keeps trying until there is no errors.

result = None
while result is None:
    try:
        # connect perform I/O
        result = get_data(...)
    except:
         pass
  • tried something like this... no luck, it's like the file is missing for the duration of the job – Shep Aug 15 '13 at 13:40
  • @Shep Have you tried to add a timeout (`time.sleep(s)`)? Maybe the file handles need time to be removed or something. I would say about 2 seconds should do the job, give it a try. –  Aug 15 '13 at 13:46
  • yeah, I tried having it check 100 times with a 5 second timeout – Shep Aug 15 '13 at 13:47
  • no, same problem. I'm looking into limiting the number of simultanious jobs (although I don't really know why that would help). – Shep Aug 15 '13 at 13:55
  • Note I can reproduce this error [without even using python](http://stackoverflow.com/q/18253796/915501) – Shep Aug 15 '13 at 13:56
  • @Shep Can you add a small chunk of code to determine if the directory exists? –  Aug 15 '13 at 14:23
  • @Shep also try to add this condition `os.path.exists()` –  Aug 15 '13 at 14:26
1

There is no reason why you should need 13K jobs all reading the same file just to pick out one line:

dataset_location = txt_file.readlines()[job_number].strip()

It would be more efficient to read the file once, and pass dataset_location to each of the 13k jobs as an argument.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • sure, this would work if the jobs were spawned in python, but they are spawned from a bash script that can only vary an integer between jobs... it's ugly, but this is the only working solution I could come up with – Shep Aug 14 '13 at 19:12