0

I am trying to create a script that removes read groups from the header of a sam file. The code, run from the command line is below.

samtools view -H e2_20.indel.recal.dedup.bam | awk ' BEGIN {FS = "\t"} {split($2,a,":")} {if ($1 != "@RG" || ($1 =="@RG" && a[2] == "e2_20")) print}' | samtools reheader - e2_20.indel.recal.dedup.bam | samtools view -H -

Here is what a sample of the input file looks like

@HD VN:1.4  GO:none SO:coordinate
@SQ SN:chr10 LN:129993255
@SQ SN:chr11 LN:121843856
@RG ID:e2_20 PL:illumina
@RG ID:e2_9 PL:illumina
@PG ID:GATK IndelRealigner

After running the command above, the output is

@HD VN:1.4  GO:none SO:coordinate
@SQ SN:chr10 LN:129993255
@SQ SN:chr11 LN:121843856
@RG ID:e2_20 PL:illumina
@PG ID:GATK IndelRealigner.

Basically, I am just removing lines that start with "@RG" and are not of ID e2_20.

The problem is that if I run this in a bash script the command does not work.

The script is below.

#!/bin/bash
#$ -j y
#$ -S /bin/bash
#$ -V
#$ -cwd
source /apps1/modules/init/bash
module load samtools/gnu/1.1
input=$1
output=$2
samp=$3
samtools view -H ${input} | awk ' BEGIN {FS = "\t"} {split($2,a,":")} {if ($1 != "@RG" || ($1 =="@RG" && a[2] == "${samp}")) print}' | samtools reheader - ${input} > ${output}
echo "samtools view -H ${input} | awk ' BEGIN {FS = "\t"} {split($2,a,":")} {if ($1 != "@RG" || ($1 =="@RG" && a[2] == "${samp}")) print}' | samtools reheader - ${input} > ${output}"

This script is run in an SGE cluster, which is the reason for the weird syntax at the top of the script. The name of the script is reHeadBams.bash. I run the script on the shell by entering:

qsub reHeadBams.bash e2_20.indel.recal.dedup.bam e2_20.prac.indel.recal.dedup.bam e2_20 

The arguments of the command are the input file, followed by the output file, and lastly the sample or read group I am trying to find.

The output of the script looks something like this:

@HD VN:1.4  GO:none SO:coordinate
@SQ SN:chr10    LN:129993255
@SQ SN:chr11    LN:121843856
@PG ID:GATK IndelRealigner.

So the script removed all the read groups, instead of the one with ID e2_9.

I echoed the command from the script and the output is

samtools view -H e2_20.indel.recal.dedup.bam | awk ' BEGIN {FS = '\t'} {split(e2_20.prac.indel.recal.dedup.bam,a,:)} {if (e2_20.indel.recal.dedup.bam != '@RG' || (e2_20.indel.recal.dedup.bam =='@RG' && a[2] == 'e2_20')) print}' | samtools reheader - e2_20.indel.recal.dedup.bam > e2_20.prac.indel.recal.dedup.bam

While I may be mistaken, the problem seems to be that awk is using my command line arguments instead of the columns of the input file as $1 and $2. Does anyone know why this is happening?

Sorry for the long winded description of my problem. If you need any clarification let me know.

Cyrus
  • 84,225
  • 14
  • 89
  • 153
szimmerman
  • 21
  • 1
  • 2
  • The double quotes on the argument to `echo` are enabling variable expansion for the entire string (even for internal parts that are single quoted). Don't use `echo` to debug this. Use `set +x` if you can. Conversely the problem in your command is that the single quotes around the `awk` script are **preventing** variable expansion so `${samp}` is staying a literal string. – Etan Reisner Apr 27 '15 at 17:34
  • just curious, why would you need to remove a read-group from the header ? – Pierre Apr 27 '15 at 19:16
  • Thanks a lot Etan for your help. I just needed to use the -v parameter in awk to use the samp variable. – szimmerman Apr 27 '15 at 19:52
  • How do you say a question has been answered? @Pierre the reason why I need to remove the read-group is because I originally merged several bamfiles together so I could do local realignment around indels between my samples using tools from GATK. When I merged the files, the read groups for each sample were written to the header. Then when I separated the file back into separate samples each file kept the merged header. When I wanted to identify variants with haplotype caller or mutect, the header did not match correctly with my sample. – szimmerman Apr 27 '15 at 20:06
  • To say the question has been answered you can provide an answer yourself, then mark it as the answer. Also, I'm curious about your reason for doing pooled indel realignment with GATK? I'd suggest looking at their Best Practice Workflows. Per-sample indel realignment is less computationally expensive and it avoids the issue you're currently dealing with. – afinit Apr 28 '15 at 22:04

1 Answers1

0

The problem was that I needed the import the variable to "samp" variable to awk using the -v parameter as shown below.

Before correction

samtools view -H ${input} | awk ' BEGIN {FS = "\t"} {split($2,a,":")} {if ($1 != "@RG" || ($1 =="@RG" && a[2] == "${samp}")) print}' | samtools reheader - ${input} > ${output}

After Correction

samtools view -H ${input} | awk -v samp="$3" ' BEGIN {FS = "\t"} {split($2,a,":")} {if ($1 != "@RG" || ($1 =="@RG" && a[2] == samp)) print}' | samtools reheader - ${input} > ${output}
szimmerman
  • 21
  • 1
  • 2