I am running a look up and edit program. Right now I have a file with a unique identifier in the first column and data in the 10th and 11th column that needs to be corrected. This file has about 40-100M lines. The file with the correct information has 4x as many lines, and is in the format where it repeats every 4 lines, with line 1n having the identifier, line 2 having the correct data for column 10, and line 4 having the correct data for column 11. I have two programs, one which splits file 1 into 250000 line fragments, and runs the following program in parallel on multiple cores.
#! /bin/bash
#$ -l h_rt=01:00:00,vf=1G
#$ -pe smp 1
#$ -cwd
#$ -j y
#$ -N unmasked
#$ -V
for line in `cut -f 1 $1`; do
seq=`awk -v a="$line" '$1~a{getline;print;exit}' ../406.fastq`
qual=`awk -v a="$line" '$1~a{getline;getline;getline;print;exit}' ../406.fastq`
awk -v s="$seq" -v q="$qual" -v l="$line" 'FS="\t" {if ($1~l) {$10=s;$11=q; print $0}}' $1 >> $1\D
done
rm $1
unfortunetly this program takes about 4-6 seconds to run a loop, and at 250000 lines that will take about 5 days and occupy a large part of the computer cluster I am using.
Any tips on doing this quicker and more efficiently? I am open to pretty much anything...