2

I have a complex python pipeline (which code I cant change), calling multiple other scripts and other executables. The point is it takes ages to run over 8000 directories, doing some scientific analyses. So, I wrote a simple wrapper, (might not be most effective, but seems to work) using the multiprocessing module.

from os import path, listdir, mkdir, system
from os.path import join as osjoin, exists, isfile
from GffTools import Gene, Element, Transcript
from GffTools import read as gread, write as gwrite, sort as gsort
from re import match
from multiprocessing import JoinableQueue, Process
from sys import argv, exit

# some absolute paths
inbase = "/.../abfgp_in"
outbase = "/.../abfgp_out"
abfgp_cmd = "python /.../abfgp-2.rev/abfgp.py"
refGff = "/.../B0510_manual_reindexed_noSeq.gff"

# the Queue
Q = JoinableQueue()
i = 0

# define number of processes
try: num_p = int(argv[1])
except ValueError: exit("Wrong CPU argument")

# This is the function calling the abfgp.py script, which in its turn calls alot of third party software
def abfgp(id_, pid):
    out = osjoin(outbase, id_)
    if not exists(out): mkdir(out)

    # logfile
    log = osjoin(outbase, "log_process_%s" %(pid))
    try:
        # call the script
        system("%s --dna %s --multifasta %s --target %s -o %s -q >>%s" %(abfgp_cmd, osjoin(inbase, id_, id_ +".dna.fa"), osjoin(inbase, id_, "informants.mfa"), id_, out, log))
    except:
        print "ABFGP FAILED"
        return

# parse the output
def extractGff(id_):
   # code not relevant


# function called by multiple processes, using the Queue
def run(Q, pid):
    while not Q.empty():
        try:
            d = Q.get()             
            print "%s\t=>>\t%s" %(str(i-Q.qsize()), d)          
            abfgp(d, pid)
            Q.task_done()
        except KeyboardInterrupt:
            exit("Interrupted Child")

# list of directories
genedirs = [d for d in listdir(inbase)]
genes = gread(refGff)
for d in genedirs:
    i += 1
    indir = osjoin(inbase, d)
    outdir = osjoin(outbase, d)
    Q.put(d)

# this loop creates the multiple processes
procs = []
for pid in range(num_p):
    try:
        p = Process(target=run, args=(Q, pid+1))
        p.daemon = True
        procs.append(p) 
        p.start()
    except KeyboardInterrupt:
        print "Aborting start of child processes"
        for x in procs:
            x.terminate()
        exit("Interrupted")     

try:
    for p in procs:
        p.join()
except:
    print "Terminating child processes"
    for x in procs:
        x.terminate()
    exit("Interrupted")

print "Parsing output..."
for d in genedirs: extractGff(d)

Now the problem is, abfgp.py uses the os.chdir function, which seems to disrupt the parallel processing. I get a lot of errors, stating that some (input/output) files/directories cannot be found for reading/writing. Even though I call the script through os.system(), from which I though spawning separate processes would prevent this.

How can I work around these chdir interference?

Edit: I might change os.system() to subprocess.Popen(cwd="...") with the right directory. I hope this makes a difference.

Thanks.

Sander
  • 591
  • 6
  • 16
  • 1
    Why are you using `os.system` rather than `subprocess.call`? It would be far less messy without that string interpolation. – Dan D. Feb 26 '14 at 10:16
  • Good tip and you are right :), but as i stated, i though os.system would solve the chdir interference – Sander Feb 26 '14 at 10:28

2 Answers2

1

Edit 2

Do not use os.system() use subprocess.call()

system("%s --dna %s --multifasta %s --target %s -o %s -q >>%s" %(abfgp_cmd, osjoin(inbase, id_, id_ +".dna.fa"), osjoin(inbase, id_, "informants.mfa"), id_, out, log))

would translate to

subprocess.call((abfgp_cmd, '--dna', osjoin(inbase, id_, id_ +".dna.fa"), '--multifasta', osjoin(inbase, id_, "informants.mfa"), '--target', id_, '-o', out, '-q')) # without log.

Edit 1 I think the problem is that multiprocessing is using the module names to serialize functions, classes.

This means if you do import module where module is in ./module.py and the you do something like os.chdir('./dir') now you would need to from .. import module.

The child processes inherit the folder of the parent process. This may be a problem.

Solutions

  1. Make sure that all modules are imported (in the child processes) and after this you change the directory
  2. insert the original os.getcwd() to sys.path to enable import from the original directory. This must be done before any functions are called from the local directory.
  3. put all functions that you use inside a directory that can always be imported. The site-packages could be such a directory. Then you can do something like import module module.main() to start what you do.
  4. This is a hack that I do because I know how pickle works. Only use this if other attempts fail. The script prints:

    serialized # the function runD is serialized
    string executed # before the function is loaded the code is executed
    loaded # now the function run is deserialized
    run # run is called
    

    In you case you would do something like this:

    runD = evalBeforeDeserialize('__import__("sys").path.append({})'.format(repr(os.getcwd())), run)
    p = Process(target=runD, args=(Q, pid+1))
    

    This is the script:

    # functions that you need
    
    class R(object):
        def __init__(self, call, *args):
    
            self.ret = (call, args)
        def __reduce__(self):
            return self.ret
        def __call__(self, *args, **kw):
            raise NotImplementedError('this should never be called')
    
    class evalBeforeDeserialize(object):
        def __init__(self, string, function):
            self.function = function
            self.string = string
        def __reduce__(self):
            return R(getattr, tuple, '__getitem__'), \
                     ((R(eval, self.string), self.function), -1)
    
    # code to show how it works        
    
    def printing():
        print('string executed')
    
    def run():
        print('run')
    
    runD = evalBeforeDeserialize('__import__("__main__").printing()', run)
    
    import pickle
    
    s = pickle.dumps(runD)
    print('serialized')
    run2 = pickle.loads(s)
    print('loaded')
    run2()
    

Please report back if these do not work.

User
  • 14,131
  • 2
  • 40
  • 59
  • I appreciate your effort, but I think you get me wrong. I cannot change the code in "abfgp.py", which is the one using chdir. So if I spawn multiple processes of abfgp.py, they will chdir per process. These different processes are interfering with each other, changing input and output directory for each other. So I can't change the imports. – Sander Feb 26 '14 at 11:13
  • Does this really mean that if you do `os.chdir` in one process it changes `os.getcwd()` in the other process? – User Feb 26 '14 at 11:21
  • I find it pretty strange too, but that is what i am experiencing. I think the cwd is stored in sys.path (didn't check), which is global for all python processes right? I might try this one: http://stackoverflow.com/questions/13757734/working-in-different-directories-os-chdir-in-the-same-time-parallel-threading – Sander Feb 26 '14 at 11:32
  • Each process has its own sys.path. In the beginning they should all look the same but they can be changed. Does subprocess work? Which operating system do you use? Maybe it is also the right way to change the code in a copy and use the copy. – User Feb 26 '14 at 12:25
  • I try to run it on some kind of supercomputer, running Ubuntu. I changed the code to subprocess.call. It takes a while before I'll get some errors, succeeds. – Sander Feb 26 '14 at 13:20
  • I changed os.system() to subprocess.call(). Theoretically it shouldn't make any difference, because call() is a more general function over system(), but still I haven't got any errors. – Sander Feb 27 '14 at 14:27
0

You could determine which instance of the os library the unalterable program is using; then create a tailored version of chdir in that library that does what you need -- prevent the directory change, log it, whatever. If the tailored behavior needs to be just for the single program, you can use the inspect module to identify the caller and tailor the behavior in a specific way for just that caller.

Your options are limited if you truly can't alter the existing program; but if you have the option of altering libraries it imports, something like this could be a least-invasive way to skirt the undesired behavior.

Usual caveats apply when altering a standard library.

Chris Johnson
  • 20,650
  • 6
  • 81
  • 80