0

I'm a physics student trying to run a research-related simulation that has stochastic elements. The simulation can be split into several non-interacting parts, each part evolving randomly and so, no interaction between runs is required.

I use a different code/file to later analyze the results returned by all jobs (which isn't related to the problem and is only supplied to present a clear background picture of what's going on).

I use the institute's HPC (which I'll refer to as "cluster") to run multiple copies of my code, which is a single .py file that doesn't read anything from any other file (but does create output files). Each copy/realization of the code is supposed to create a unique working directory for each individual realization of the code, using os.makedirs(path,exist_ok=True) and then os.chdir(path). I've made already numerous attempts at running this, ending with the following behavior types:

  1. Some of the array-jobs running and behaving well.
  2. Others, overwrite each-other (i.e. both job1 and job2 both write to a .txt file in job1's directory)
  3. Others just plain-out not creating a directory at all but not dying either so I guess they continue running and probably writing my data somewhere which I don't know of and can't access.

These behaviors seem completely random to me, in the sense that I don't know ahead of time which array-job will work flawlessly and which will have jobs that behavior 2, behavior 3 or both (It can be the case that for a large job-array, I'll have some jobs running well, some showing behavior 2 and some showing behavior 3 as well as just 2 or just 3).

I've pretty much tried everything I could find online, for example I read somewhere that a common problem using os.makedirs is something about a umask issue and that adding os.umask(0) before calling is it good practice and so I added it in. I've also read that sometimes a cluster can get hung up and so calling time.sleep for a few seconds and trying again might work so I also did that. Nothing solved the problem yet...

I'm attaching the part of the code that might be the culprit for inspection where N,L,T and DT are numbers I set earlier in the code where I also import the libraries and such (Note that the office computer runs Windows, while the cluster runs Linux so I use os.name just to set my directories according to the OS I'm running so the code can run without modification on both systems):

when = datetime.datetime.now()
date = when.date()
worker_num = os.environ['LSB_JOBINDEX']
pid = os.environ['LSB_JOBID']   
work = 'worker'+worker_num
txt_file = 'N{}_L{}_T{}_DT{}'.format(N, L,T, DT)
if os.name == 'nt':
    path = 'D:/My files/Python Scripts/Cluster/{}/{}/{}'.format(date,txt_file,work)
else:
    path = '/home/labs/{}/{}/{}'.format(date,txt_file,work)   
    os.umask(0)    
try: 
    os.makedirs(path, exist_ok=True)
    os.chdir(path)
except OSError:
    time.sleep(10)
    with open('/home/labs/error_{}_{}.txt'.format(txt_file,work),'a+') as f:
        f.write('In {}, at time {}, job ID: {}, which was sent to queue: {}, working on host: {}, failed to create path: {} '.format(date, hour, pid,os.environ['LSB_QUEUE'], os.environ['LSB_HOSTS'], path))    
    os.makedirs(path, exist_ok=True)
    os.chdir(path)

The cluster's environment is an LSF environment. To run multiple realizations of my code, I use the "arrayjob" command, i.e. using LSF to send multiple instances (100 in this case) of the same code to several different CPUs on different (or the same) hosts in the cluster.

I'm also attaching examples showing the errors I describe above. An example for behavior 2 is the following output file:

Stst progress = 10.0% after 37 seconds

Stst progress = 10.0% after 42 seconds

Stst progress = 20.0% after 64 seconds

Stst progress = 20.0% after 75 seconds

Stst progress = 30.0% after 109 seconds

Stst progress = 40.0% after 139 seconds

worker99 is 5.00% finished after 0.586 hours and will finish in approx 11.137 hours
worker99 is 5.00% finished after 0.691 hours and will finish in approx 13.130 hours
worker99 is 10.00% finished after 1.154 hours and will finish in approx 10.382 hours
worker99 is 10.00% finished after 1.340 hours and will finish in approx 12.062 hours
worker99 is 15.00% finished after 1.721 hours and will finish in approx 9.753 hours
worker99 is 15.00% finished after 1.990 hours and will finish in approx 11.275 hours
worker99 is 20.00% finished after 2.287 hours and will finish in approx 9.148 hours
worker99 is 20.00% finished after 2.633 hours and will finish in approx 10.532 hours
worker99 is 25.00% finished after 2.878 hours and will finish in approx 8.633 hours
worker99 is 25.00% finished after 3.275 hours and will finish in approx 9.826 hours
worker99 is 30.00% finished after 3.443 hours and will finish in approx 8.033 hours
worker99 is 30.00% finished after 3.921 hours and will finish in approx 9.149 hours
worker99 is 35.00% finished after 4.015 hours and will finish in approx 7.456 hours
worker99 is 35.00% finished after 4.566 hours and will finish in approx 8.480 hours
worker99 is 40.00% finished after 4.616 hours and will finish in approx 6.924 hours
worker99 is 45.00% finished after 5.182 hours and will finish in approx 6.334 hours
worker99 is 40.00% finished after 5.209 hours and will finish in approx 7.814 hours
worker99 is 50.00% finished after 5.750 hours and will finish in approx 5.750 hours
worker99 is 45.00% finished after 5.981 hours and will finish in approx 7.310 hours
worker99 is 55.00% finished after 6.322 hours and will finish in approx 5.173 hours
worker99 is 50.00% finished after 6.623 hours and will finish in approx 6.623 hours
worker99 is 60.00% finished after 6.927 hours and will finish in approx 4.618 hours
worker99 is 55.00% finished after 7.266 hours and will finish in approx 5.945 hours
worker99 is 65.00% finished after 7.513 hours and will finish in approx 4.046 hours
worker99 is 60.00% finished after 7.928 hours and will finish in approx 5.285 hours
worker99 is 70.00% finished after 8.079 hours and will finish in approx 3.463 hours
worker99 is 65.00% finished after 8.580 hours and will finish in approx 4.620 hours
worker99 is 75.00% finished after 8.644 hours and will finish in approx 2.881 hours
worker99 is 80.00% finished after 9.212 hours and will finish in approx 2.303 hours
worker99 is 70.00% finished after 9.227 hours and will finish in approx 3.954 hours
worker99 is 85.00% finished after 9.778 hours and will finish in approx 1.726 hours
worker99 is 75.00% finished after 9.882 hours and will finish in approx 3.294 hours
worker99 is 90.00% finished after 10.344 hours and will finish in approx 1.149 hours
worker99 is 80.00% finished after 10.532 hours and will finish in approx 2.633 hours

A .txt file like this, made for the purpose of keeping track of the code's progress, is normally created by each job individually and stored in its own directory. In this case, for some reason, two different jobs are writing to the same file. This is verified when observing a different .txt file that is created right after a directory is created and the working directory is determined:

In 2016-04-01, at time 02:11:51.851948, job ID: 373244, which was sent to
queue: new-short, working on host: cn129.wexac.weizmann.ac.il, has created 
path: /home/labs/2016-04-02/N800_L1600_T10_DT0.5/worker99 

In 2016-04-01, at time 02:12:09.968549, job ID: 373245, which was sent to 
queue: new-medium, working on host: cn293.wexac.weizmann.ac.il, has created 
path: /home/labs/2016-04-02/N800_L1600_T10_DT0.5/worker99   

I'd very much appreciate any help I can get solving this problem as it is holding us back from advancing our research. If any additional details are required for figuring this one out, I'd be happy to supply them.
Thanks!

Asaf M
  • 21
  • 2
  • Is your problem essentially " In this case, for some reason, two different jobs are writing to the same file."? –  Apr 01 '16 at 10:32
  • Is it guaranteed that all the `LSB_JOBINDEX` envvars are unique inside each process? Have you examined (printed) `os.environ['LSB_JOBINDEX']`? –  Apr 01 '16 at 10:56
  • @Evert : The problem is essentially that in some cases, for some reason, two different jobs are writing to the same files (not only the text file, but I guess to the other data files as well which is very bad) but also that in some cases, not all directories are being created. I don't know if these are connected or individual problems. – Asaf M Apr 01 '16 at 11:43
  • Regarding the 'LSB_JOBINDEX', there are allot of cases where there is no problem at all with a specific array job, i.e. all of the unique directories are created and the jobs write in their directories. If there were such a problem, shouldn't it always happen? – Asaf M Apr 01 '16 at 11:49
  • " shouldn't it always happen". No: there is an amount of time between the job being started and your script reading the envvar. That amount of time is not fixed, as it depends on the current load (i.e., other background processes and such). Not saying this is the issue, but it could lead to a form of a race condition. –  Apr 01 '16 at 11:55

1 Answers1

0

Looking at the error log you supplied it shows that the two jobs (373244 and 373245) are being sent to two different queues:

In 2016-04-01, at time 02:11:51.851948, job ID: 373244, which was sent to queue: new-short, ...

In 2016-04-01, at time 02:12:09.968549, job ID: 373245, which was sent to queue: new-medium, ...

This suggests that the array job is being issued twice to two separate queues. You might look at the code that is issuing the array job to ensure it only runs once, sending the job to a single queue.

Issuing the array job more than once would cause the behavior you are seeing I think.

RichTBreak
  • 559
  • 6
  • 11