I need to run a bunch of simulations using a tool called ngspice, and since I want to run a million simulations, I am distributing them across a cluster of machines (master+ a slave to start with, which have 12 cores each).
This is the command:
ngspice deck_1.sp
; ngspice deck_2.sp
etc.,
Step 1: A python script is used to generate these sp files.
Step 2: Python invokes GNU parallel to distribute the sp files across the master/slave and run the simulations using ngspice
Step 3: I post-process the results (python script).
I generate and process only 1000 files at a time to save disk space. So the above Step 1 to 3 are repeated in a loop till a million files are simulated.
Now, my problem is:
When I execute the loop for the 1st time, I have no problem. The files are distributed across the master/slave till the 1000 simulations are complete. When the loop starts off the second time, I clear off the existing sp files and regenerate them (step 1). Now, when I execute step 2- for some strange reason, some files are not being detected. After some debugging, the error I get is- "Stale NFS file handle" and "No such file or directory deck_21.sp" etc., for certain sp files that are created in step 1.
I paused my python script and did an 'ls
' in the directory and I see that the files actually exist, but like the error points out, it is because of the Stale NFS file handle. This link recommends that I remount the client etc., but I am logged into a machine to which I have no admin privileges to mount.
Is there a way I can resolve this?
Thanks!