0

I am trying to run GATK DepthOfCoverage on some BAM files that I have merged from two original files (the same sample was sequenced on two lanes to maximize the number of reads). I realized after the fact that my merged file has reads with different read groups (as reflected by the RG field of each read), and that the header of my two original files differed in their @RG fields.

I have tried running samtools reheader adding a new @RG field in the header, but when I merge the two files each read group is based on the name of the two BAM files, not on the name of the @RG in the headers of the two BAM files.

For example, my two starting samples are:

27163.pe.markdup.bam
27091.pe.markdup.bam

but when I merge them using samtools merge

samtools merge merged.bam 27163.pe.markdup.bam 27091.pe.markdup.bam 

The resulting merged.bam has the same @RG field in the header as only one of the two, and each of the reads has a read name based on the name of the file it came from as such:

Read 1

RG:Z:27091.pe.markdup

Read 2

RG:Z:27163.pe.markdup

etc. for the rest of the reads in the BAM

Am I doing something wrong? Should I rehead each of the original files before merging? Or simply rehead after merging to something that is compatible with GATK? It seems like no matter what the @RG field in the header is before merging, the merged file will always have reads with different RGs based on the name of the two input files.

I am also not sure what does GATK DepthOfCoverage want as input in terms of read groups. Does it want a single RG for all reads? In that case, should I use something different than samtools merge?

Thanks in advance for any help you can give me.

user3245575
  • 83
  • 2
  • 9

1 Answers1

1

For future reference, please see the worked out solution here:

https://www.biostars.org/p/105787/#107970

Basically the correct procedure is to merge using Picard instead of samtools which gives output compatible with GATK in terms of the bam file read group vocabulary.

user3245575
  • 83
  • 2
  • 9