I am trying to create a script that removes read groups from the header of a sam file. The code, run from the command line is below.
samtools view -H e2_20.indel.recal.dedup.bam | awk ' BEGIN {FS = "\t"} {split($2,a,":")} {if ($1 != "@RG" || ($1 =="@RG" && a[2] == "e2_20")) print}' | samtools reheader - e2_20.indel.recal.dedup.bam | samtools view -H -
Here is what a sample of the input file looks like
@HD VN:1.4 GO:none SO:coordinate
@SQ SN:chr10 LN:129993255
@SQ SN:chr11 LN:121843856
@RG ID:e2_20 PL:illumina
@RG ID:e2_9 PL:illumina
@PG ID:GATK IndelRealigner
After running the command above, the output is
@HD VN:1.4 GO:none SO:coordinate
@SQ SN:chr10 LN:129993255
@SQ SN:chr11 LN:121843856
@RG ID:e2_20 PL:illumina
@PG ID:GATK IndelRealigner.
Basically, I am just removing lines that start with "@RG" and are not of ID e2_20.
The problem is that if I run this in a bash script the command does not work.
The script is below.
#!/bin/bash
#$ -j y
#$ -S /bin/bash
#$ -V
#$ -cwd
source /apps1/modules/init/bash
module load samtools/gnu/1.1
input=$1
output=$2
samp=$3
samtools view -H ${input} | awk ' BEGIN {FS = "\t"} {split($2,a,":")} {if ($1 != "@RG" || ($1 =="@RG" && a[2] == "${samp}")) print}' | samtools reheader - ${input} > ${output}
echo "samtools view -H ${input} | awk ' BEGIN {FS = "\t"} {split($2,a,":")} {if ($1 != "@RG" || ($1 =="@RG" && a[2] == "${samp}")) print}' | samtools reheader - ${input} > ${output}"
This script is run in an SGE cluster, which is the reason for the weird syntax at the top of the script. The name of the script is reHeadBams.bash. I run the script on the shell by entering:
qsub reHeadBams.bash e2_20.indel.recal.dedup.bam e2_20.prac.indel.recal.dedup.bam e2_20
The arguments of the command are the input file, followed by the output file, and lastly the sample or read group I am trying to find.
The output of the script looks something like this:
@HD VN:1.4 GO:none SO:coordinate @SQ SN:chr10 LN:129993255 @SQ SN:chr11 LN:121843856
@PG ID:GATK IndelRealigner.
So the script removed all the read groups, instead of the one with ID e2_9.
I echoed the command from the script and the output is
samtools view -H e2_20.indel.recal.dedup.bam | awk ' BEGIN {FS = '\t'} {split(e2_20.prac.indel.recal.dedup.bam,a,:)} {if (e2_20.indel.recal.dedup.bam != '@RG' || (e2_20.indel.recal.dedup.bam =='@RG' && a[2] == 'e2_20')) print}' | samtools reheader - e2_20.indel.recal.dedup.bam > e2_20.prac.indel.recal.dedup.bam
While I may be mistaken, the problem seems to be that awk is using my command line arguments instead of the columns of the input file as $1 and $2. Does anyone know why this is happening?
Sorry for the long winded description of my problem. If you need any clarification let me know.