-1

This is a two-part question:

  1. help interpreting an error;
  2. help with coding.

  1. I'm trying to run bwa-mem and sambamba to aling raw reads to a reference genome and to sort by position. These are the commands I'm using:
bwa mem  \
  -K 100000000 -v 3 -t 6 -Y \
  -R '\@S200031047L1C001R002\S*[1-2]' \
  /path/to/reference/GCF_009858895.2_ASM985889v3_genomic.fna \
  /path/to/raw-fastq/'S\S[^_]*_L01_[0-9]+-[0-9]+'_1.fq.gz \
  /path/to/raw-fastq/'S\S[^_]*_L01_[0-9]+-[0-9]+'_2.fq.gz | \
/path/to/genomics/sambamba-0.8.2 view -S -f bam \
  /dev/stdin | \
/path/to/genomics/sambamba-0.8.2 sort  \
  /dev/stdin \
  --out host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam

This is the error message I'm getting: [E::bwa_set_rg] the read group line is not started with @RG.

My sequences were generated with an MGI sequencer and the readgroups are identified like this @S200031047L1C001R0020000243/1, i.e., they don't beging with an @RG. How can I specify to sambamba that my readgroups start with @S and not @RG?


  1. The commands written above are a published pipeline I'm modifying for my own research. However, among several changes, I'm not confident on how to define sample id as such stated in the last line of the code: --out host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam (I'm referring to ${SAMPLE}). Any insights?

Thank you very much!

  • I think you'll get answers if you delete this post and repost here https://bioinformatics.stackexchange.com – user438383 Jun 20 '22 at 16:37

1 Answers1

0

1. Specifying read groups

Your read group string is not correctly formatted. It should be like '@RG\tID:$ID\tSM:$SM\tLB:$LB\tPU:$PU\tPL:$PL' where the parts beginning with a $ sign should be replaced with the information specific to your sequencing run and sample. Not all of them are required for all purposes. See this read group documentation by GATK team for an example.

Read group specification always begins with @RG. That's part of SAM format. Sequencers do not produce read groups. I think you may be confusing them with fastq header lines. Entries in the read group string are separated by tabs, denoted with \t. Tags and their values are separated by :.

The difference between $ID (read group id) and $SM (sample id) is that sample is the individual or biological sample which may have been sequenced several times in different libraries ($LB). In the GATK documentation they combine flowcell and library into the read group id. Sample and library could make an intuitive read group id in small projects. If you are working on your own project that is not part of a larger sequencing effort, you can define the read groups as you like. If several people work in the same project, you should be consistent to avoid problems later.

2. Variable substitution

I'm not sure if I understood you correctly, but if you are wondering what ${SAMPLE} means in the command, it's a variable called SAMPLE that will be replaced by its value when the command is run. The curly brackets protect the name so that the shell does not confuse the variable name with characters coming after it. See here for examples.

Cloudberry
  • 240
  • 2
  • 8