1

I am running a look up and edit program. Right now I have a file with a unique identifier in the first column and data in the 10th and 11th column that needs to be corrected. This file has about 40-100M lines. The file with the correct information has 4x as many lines, and is in the format where it repeats every 4 lines, with line 1n having the identifier, line 2 having the correct data for column 10, and line 4 having the correct data for column 11. I have two programs, one which splits file 1 into 250000 line fragments, and runs the following program in parallel on multiple cores.

#! /bin/bash
#$ -l h_rt=01:00:00,vf=1G
#$ -pe smp 1
#$ -cwd
#$ -j y
#$ -N unmasked
#$ -V

for line in `cut -f 1 $1`; do
        seq=`awk -v a="$line" '$1~a{getline;print;exit}' ../406.fastq`
        qual=`awk -v a="$line" '$1~a{getline;getline;getline;print;exit}' ../406.fastq`
        awk -v s="$seq" -v q="$qual" -v l="$line" 'FS="\t" {if ($1~l) {$10=s;$11=q; print $0}}' $1 >> $1\D
done
rm $1

unfortunetly this program takes about 4-6 seconds to run a loop, and at 250000 lines that will take about 5 days and occupy a large part of the computer cluster I am using.

Any tips on doing this quicker and more efficiently? I am open to pretty much anything...

jeffpkamp
  • 2,732
  • 2
  • 27
  • 51
  • 2
    You could start by not running `awk` over the same file twice to pull out different values. That's just wasteful.. – Mark Reed Dec 05 '13 at 16:35
  • Look at [Gnu parallel](http://www.gnu.org/software/parallel/). – Yohann Dec 05 '13 at 16:48
  • This is not a CPU-bound problem. Parallelism isn't going to help unless you intend to spread the data over multiple disks. – slim Dec 05 '13 at 17:09

1 Answers1

1

Shell scripting isn't a great fit for this kind of job. This program spawns three short-lived awk processes per line of input, and while UNIX process creation is cheaper than on Windows, you still don't want to pay the process creation overhead 300M times!

(Correction: process creation is the least of your worries. It's reading through a 400M line file twice on each iteration!)

Use your preferred "real" scripting language -- I'd be tempted to use Perl, but Python is a fine choice too. It can probably be done in a self-contained awk script too, but if you were that good at awk, you wouldn't be asking this question -- and Perl exists so you don't have to become an awk guru!

Write a script along the lines of this pseudocode, which holds both files open, and assumes that both of them have the information in the same order.

 open file1 and file2
 read 1 line from file1 and 4 lines from file2 into string variables
 while(reads didn't fail) {
     parse desired information from lines
     output in the format you want
     read 1 line from file1 and 4 lines from file2 into string variables
 }
 close both files

You'll probably find this is fast enough that there's no need to try and parallelise it. I would expect it to be constrained by disk access, not CPU.


If the two files are not in the same order, you have more of a problem. Sorting 100M items is not cheap. Your easiest option here is to first iterate through the longer file putting the values you need into a map data structure, like a Perl hash or a Python dictionary - or even a database like Redis - then iterate through the shorter file, pulling the values you need to rewrite the lines, out of the map.

slim
  • 40,215
  • 13
  • 94
  • 127