6

We have a folder with 50 datafiles (next-gen DNA sequences) that need to be converted by running a python script on each one. The script takes 5 hours per file and it is single threaded and is largely CPU bound (the CPU core runs at 99% with minimal disk IO).

Since I have a 4 core machine, I'd like to run 4 instances of this script at once to vastly speed up the process.

I guess I could split the data into 4 folders and in run the following bash script on each folder at the same time:

files=`ls -1 *`
for $file in $files;
do
   out = $file+=".out" 
   python fastq_groom.py $file $out
done

But there must be a better way of running it on the one folder. We can use Bash/Python/Perl/Windows to do this.
(Sadly making the script multi threaded is beyond what we can do)


Using @phs xargs solution was the easiest way for us to solve the problem. We are however requesting the original developer implements @Björn answer. Once again thanks!

Jon Rhoades
  • 496
  • 4
  • 12
  • 2
    The use of `ls` in backticks, and assigning the value to a variable to boot, is a frequent antipattern. It will break on file names with spaces, and it will break if you have subdirectories. The correct idiom is `for file in *` - note also the absence of a dollar sign when naming a variable; you use a dollar sign when interpolating a variable. See also http://partmaps.org/era/unix/award.html#ls – tripleee Jan 23 '12 at 08:38

4 Answers4

7

You can use the multiprocessing-module. I suppose you have a list of files to process and a function to call for each file. Then you could simply use a worker-pool like this:

from multiprocessing import Pool, cpu_count

pool = Pool(processes=cpu_count)
pool.map(process_function, file_list, chunksize=1)

If your process_function doesn't return a value, you can simply ignore the return-value.

Björn Pollex
  • 75,346
  • 28
  • 201
  • 283
1

If you have GNU Parallel you can do:

parallel python fastq_groom.py {} {}.out ::: files*

It will do The Right Thing by spawning a job per core, even if the names of your files have space, ', or " in them. It also makes sure the output from different jobs are not mixed together, so if you use the output you are guaranteed that you will not get half-a-line from two different jobs.

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
1

Take a look at xargs. It's -P option offers a configurable degree of parallelism. Specifically, something like this should work for you:

ls files* | awk '{print $1,$1".out"}' | xargs -P 4 -n 2 python fastq_groom.py
phs
  • 10,687
  • 4
  • 58
  • 84
  • 1
    this will break with files that have spaces in their names and probably other nasties like newlines and the like – SiegeX Jan 23 '12 at 07:53
1

Give this a shot:

#!/bin/bash

files=( * )
for((i=0;i<${#files[@]};i+=4)); do
  { 
     python fastq_groom.py "${files[$i]}" "${files[$i]}".out &
     python fastq_groom.py "${files[$i+1]}" "${files[$i+1]}".out &
     python fastq_groom.py "${files[$i+2]}" "${files[$i+2]}".out &
     python fastq_groom.py "${files[$i+3]}" "${files[$i+3]}".out &
  }
done

The following puts all files into an array named files. It then executes and backgrounds four python processes on the first four files. As soon as all four of those processes are complete, it executes the next four. It's not as efficient as always keeping a queue of 4 going but if all processes take around the same amount of time, it should be pretty close to that.

Also, please please please don't use the output of ls like that. Just use standard globbing as in for files in *.txt; do ...; done

SiegeX
  • 135,741
  • 24
  • 144
  • 154