Stale NFS file handle issue on a remote cluster

Question

I need to run a bunch of simulations using a tool called ngspice, and since I want to run a million simulations, I am distributing them across a cluster of machines (master+ a slave to start with, which have 12 cores each).

This is the command: ngspice deck_1.sp; ngspice deck_2.sp etc.,

Step 1: A python script is used to generate these sp files.

Step 2: Python invokes GNU parallel to distribute the sp files across the master/slave and run the simulations using ngspice

Step 3: I post-process the results (python script).

I generate and process only 1000 files at a time to save disk space. So the above Step 1 to 3 are repeated in a loop till a million files are simulated.

Now, my problem is:

When I execute the loop for the 1st time, I have no problem. The files are distributed across the master/slave till the 1000 simulations are complete. When the loop starts off the second time, I clear off the existing sp files and regenerate them (step 1). Now, when I execute step 2- for some strange reason, some files are not being detected. After some debugging, the error I get is- "Stale NFS file handle" and "No such file or directory deck_21.sp" etc., for certain sp files that are created in step 1.

I paused my python script and did an 'ls' in the directory and I see that the files actually exist, but like the error points out, it is because of the Stale NFS file handle. This link recommends that I remount the client etc., but I am logged into a machine to which I have no admin privileges to mount. Is there a way I can resolve this?

Thanks!

score 1 · Answer 1 · answered Apr 25 '13 at 07:50

1

No. You need admin prviledges to fix this.

answered Apr 25 '13 at 07:50

user2215348

31
2

Stale NFS file handle issue on a remote cluster

1 Answers1