I'm a physics student trying to run a research-related simulation that has stochastic elements. The simulation can be split into several non-interacting parts, each part evolving randomly and so, no interaction between runs is required.
I use a different code/file to later analyze the results returned by all jobs (which isn't related to the problem and is only supplied to present a clear background picture of what's going on).
I use the institute's HPC (which I'll refer to as "cluster") to run multiple copies of my code, which is a single .py file that doesn't read anything from any other file (but does create output files). Each copy/realization of the code is supposed to create a unique working directory for each individual realization of the code, using os.makedirs(path,exist_ok=True)
and then os.chdir(path)
.
I've made already numerous attempts at running this, ending with the following behavior types:
- Some of the array-jobs running and behaving well.
- Others, overwrite each-other (i.e. both job1 and job2 both write to a .txt file in job1's directory)
- Others just plain-out not creating a directory at all but not dying either so I guess they continue running and probably writing my data somewhere which I don't know of and can't access.
These behaviors seem completely random to me, in the sense that I don't know ahead of time which array-job will work flawlessly and which will have jobs that behavior 2, behavior 3 or both (It can be the case that for a large job-array, I'll have some jobs running well, some showing behavior 2 and some showing behavior 3 as well as just 2 or just 3).
I've pretty much tried everything I could find online, for example I read somewhere that a common problem using os.makedirs
is something about a umask
issue and that adding os.umask(0)
before calling is it good practice and so I added it in. I've also read that sometimes a cluster can get hung up and so calling time.sleep
for a few seconds and trying again might work so I also did that. Nothing solved the problem yet...
I'm attaching the part of the code that might be the culprit for inspection where N,L,T
and DT
are numbers I set earlier in the code where I also import the libraries and such (Note that the office computer runs Windows, while the cluster runs Linux so I use os.name
just to set my directories according to the OS I'm running so the code can run without modification on both systems):
when = datetime.datetime.now()
date = when.date()
worker_num = os.environ['LSB_JOBINDEX']
pid = os.environ['LSB_JOBID']
work = 'worker'+worker_num
txt_file = 'N{}_L{}_T{}_DT{}'.format(N, L,T, DT)
if os.name == 'nt':
path = 'D:/My files/Python Scripts/Cluster/{}/{}/{}'.format(date,txt_file,work)
else:
path = '/home/labs/{}/{}/{}'.format(date,txt_file,work)
os.umask(0)
try:
os.makedirs(path, exist_ok=True)
os.chdir(path)
except OSError:
time.sleep(10)
with open('/home/labs/error_{}_{}.txt'.format(txt_file,work),'a+') as f:
f.write('In {}, at time {}, job ID: {}, which was sent to queue: {}, working on host: {}, failed to create path: {} '.format(date, hour, pid,os.environ['LSB_QUEUE'], os.environ['LSB_HOSTS'], path))
os.makedirs(path, exist_ok=True)
os.chdir(path)
The cluster's environment is an LSF environment. To run multiple realizations of my code, I use the "arrayjob" command, i.e. using LSF to send multiple instances (100 in this case) of the same code to several different CPUs on different (or the same) hosts in the cluster.
I'm also attaching examples showing the errors I describe above. An example for behavior 2 is the following output file:
Stst progress = 10.0% after 37 seconds
Stst progress = 10.0% after 42 seconds
Stst progress = 20.0% after 64 seconds
Stst progress = 20.0% after 75 seconds
Stst progress = 30.0% after 109 seconds
Stst progress = 40.0% after 139 seconds
worker99 is 5.00% finished after 0.586 hours and will finish in approx 11.137 hours
worker99 is 5.00% finished after 0.691 hours and will finish in approx 13.130 hours
worker99 is 10.00% finished after 1.154 hours and will finish in approx 10.382 hours
worker99 is 10.00% finished after 1.340 hours and will finish in approx 12.062 hours
worker99 is 15.00% finished after 1.721 hours and will finish in approx 9.753 hours
worker99 is 15.00% finished after 1.990 hours and will finish in approx 11.275 hours
worker99 is 20.00% finished after 2.287 hours and will finish in approx 9.148 hours
worker99 is 20.00% finished after 2.633 hours and will finish in approx 10.532 hours
worker99 is 25.00% finished after 2.878 hours and will finish in approx 8.633 hours
worker99 is 25.00% finished after 3.275 hours and will finish in approx 9.826 hours
worker99 is 30.00% finished after 3.443 hours and will finish in approx 8.033 hours
worker99 is 30.00% finished after 3.921 hours and will finish in approx 9.149 hours
worker99 is 35.00% finished after 4.015 hours and will finish in approx 7.456 hours
worker99 is 35.00% finished after 4.566 hours and will finish in approx 8.480 hours
worker99 is 40.00% finished after 4.616 hours and will finish in approx 6.924 hours
worker99 is 45.00% finished after 5.182 hours and will finish in approx 6.334 hours
worker99 is 40.00% finished after 5.209 hours and will finish in approx 7.814 hours
worker99 is 50.00% finished after 5.750 hours and will finish in approx 5.750 hours
worker99 is 45.00% finished after 5.981 hours and will finish in approx 7.310 hours
worker99 is 55.00% finished after 6.322 hours and will finish in approx 5.173 hours
worker99 is 50.00% finished after 6.623 hours and will finish in approx 6.623 hours
worker99 is 60.00% finished after 6.927 hours and will finish in approx 4.618 hours
worker99 is 55.00% finished after 7.266 hours and will finish in approx 5.945 hours
worker99 is 65.00% finished after 7.513 hours and will finish in approx 4.046 hours
worker99 is 60.00% finished after 7.928 hours and will finish in approx 5.285 hours
worker99 is 70.00% finished after 8.079 hours and will finish in approx 3.463 hours
worker99 is 65.00% finished after 8.580 hours and will finish in approx 4.620 hours
worker99 is 75.00% finished after 8.644 hours and will finish in approx 2.881 hours
worker99 is 80.00% finished after 9.212 hours and will finish in approx 2.303 hours
worker99 is 70.00% finished after 9.227 hours and will finish in approx 3.954 hours
worker99 is 85.00% finished after 9.778 hours and will finish in approx 1.726 hours
worker99 is 75.00% finished after 9.882 hours and will finish in approx 3.294 hours
worker99 is 90.00% finished after 10.344 hours and will finish in approx 1.149 hours
worker99 is 80.00% finished after 10.532 hours and will finish in approx 2.633 hours
A .txt file like this, made for the purpose of keeping track of the code's progress, is normally created by each job individually and stored in its own directory. In this case, for some reason, two different jobs are writing to the same file. This is verified when observing a different .txt file that is created right after a directory is created and the working directory is determined:
In 2016-04-01, at time 02:11:51.851948, job ID: 373244, which was sent to
queue: new-short, working on host: cn129.wexac.weizmann.ac.il, has created
path: /home/labs/2016-04-02/N800_L1600_T10_DT0.5/worker99
In 2016-04-01, at time 02:12:09.968549, job ID: 373245, which was sent to
queue: new-medium, working on host: cn293.wexac.weizmann.ac.il, has created
path: /home/labs/2016-04-02/N800_L1600_T10_DT0.5/worker99
I'd very much appreciate any help I can get solving this problem as it is holding us back from advancing our research. If any additional details are required for figuring this one out, I'd be happy to supply them.
Thanks!