Multiply read text files files don't seem to exist?

Question

I'm running batch jobs on a RHEL5 lustre filesystem. Many jobs (13k) read the same text file, which is used to direct each job to a different dataset. The code looks like this:

with open('dataset-paths.txt') as txt_file: 
    dataset_location = txt_file.readlines()[job_number].strip()

But for some fraction of my jobs, I get

IOError: [Errno 2] No such file or directory: 'dataset-paths.txt'

Is it not possible to open the same text file from multiple processes at the same time? What else could cause this?

@user2357112 no, nothing changes the file, the file is only used within that `with` block — Shep, Aug 14 '13 at 18:27

score 1 · Answer 1 · answered Aug 14 '13 at 18:35

1

Just a "random guess", maybe to error message is just misleading?

Remember there is a limit on the number of opened files -- or to be precise of the number of file descriptors. Given the high number of process involved, it is quite possible that at some point during execution that limit is reached...

answered Aug 14 '13 at 18:35

Sylvain Leroux

50,096
7
103
125

`cat /proc/sys/fs/file-max` gives me 829173, which is far more than the number of jobs I'm running (about 18k) – Shep Aug 15 '13 at 13:43
@Shep I'm not a specialist of that kind of stuff, but as far as I remember, there is a "system" limit, as well as par user/per group/per process group limit(s). On my system, `ulimit -Hn -Sn` reports only 1024 max open files. Whereas `/proc/sys/fs/file-max` is way beyond that. – Sylvain Leroux Aug 15 '13 at 15:23

score 1 · Answer 2 · answered Aug 14 '13 at 18:37

1

I have no clue why that is happening, maybe locks on the file or too many open file handles. But apply this when you open/interact with your file. It basically keeps trying until there is no errors.

result = None
while result is None:
    try:
        # connect perform I/O
        result = get_data(...)
    except:
         pass

answered Aug 14 '13 at 18:37

tried something like this... no luck, it's like the file is missing for the duration of the job – Shep Aug 15 '13 at 13:40
@Shep Have you tried to add a timeout (`time.sleep(s)`)? Maybe the file handles need time to be removed or something. I would say about 2 seconds should do the job, give it a try. – Aug 15 '13 at 13:46
yeah, I tried having it check 100 times with a 5 second timeout – Shep Aug 15 '13 at 13:47
no, same problem. I'm looking into limiting the number of simultanious jobs (although I don't really know why that would help). – Shep Aug 15 '13 at 13:55
Note I can reproduce this error [without even using python](http://stackoverflow.com/q/18253796/915501) – Shep Aug 15 '13 at 13:56
@Shep Can you add a small chunk of code to determine if the directory exists? – Aug 15 '13 at 14:23
@Shep also try to add this condition `os.path.exists()` – Aug 15 '13 at 14:26

score 1 · Answer 3 · answered Aug 14 '13 at 19:00

1

There is no reason why you should need 13K jobs all reading the same file just to pick out one line:

dataset_location = txt_file.readlines()[job_number].strip()

It would be more efficient to read the file once, and pass dataset_location to each of the 13k jobs as an argument.

answered Aug 14 '13 at 19:00

unutbu

842,883
184
1,785
1,677

sure, this would work if the jobs were spawned in python, but they are spawned from a bash script that can only vary an integer between jobs... it's ugly, but this is the only working solution I could come up with – Shep Aug 14 '13 at 19:12

Multiply read text files files don't seem to exist?

3 Answers3