0

i'm trying to use BWA MEM to align some WGS files, but i notice something strange. When I used samtools flagstat to check these .bam files, I notice that most reads were unmapped.

76124692 + 0 in total (QC-passed reads + QC-failed reads)
308 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
708109 + 0 mapped (0.93% : N/A)
76124384 + 0 paired in sequencing
38062192 + 0 read1
38062192 + 0 read2
0 + 0 properly paired (0.00% : N/A)
12806 + 0 with itself and mate mapped
694995 + 0 singletons (0.91% : N/A)
11012 + 0 with mate mapped to a different chr
1682 + 0 with mate mapped to a different chr (mapQ>=5)

Previously, I used Samtofastq to convert my .bam file to .fastq. When I head this file, this is shown:

@SRR1513845.100000000/1
AACGAAACGAAAAGAAAAGAAAAGAAAGAAAAAGAAAGGAACAGAAAAG
+
AAA?=>'2&)&)&&))2(-'(,.%)&31%%'6/6,(1,501046124&6
@SRR1513845.100000000/2
AATTAATTAAGCCCCGAAGGAAGCGAGAAACACTG
+
AAA?B=AB@A@A=?A>AA@?.@?8<.1;><*17?<
@SRR1513845.100000001/1
TATAACCATATAACAAATCCAAGCCCAACAGAGAAGAGAAACAAAAAGA
+
>27<@>&856;.'.&9.%>%::-5194&:+'5);;%1&'/%%999%5(8
@SRR1513845.100000001/2
TCCAACTGATATCGTAATT
+
@3<@A>:8;?:383>=3:=
@SRR1513845.100000003/1
TATCGGTCTTGTTTAG
+
=1;=6?(4>4A13?0A
@SRR1513845.100000003/2
TTCAGGTGCCTCGAAGTTGGATAAGG
+
==>>9@;?3<A5>7);)<9-<25<9?
@SRR1513845.100000004/1
GTCATTTAGCCCAAGAGAATGGC
+
BB@ABA@@A?</A>>25A;@4:5
@SRR1513845.100000004/2
GGAGATCGAGTCAAATTTTATGCTAGGTAT
+
%A:<@7A@@=4AA?7<A5>@;3&?>>:;:>
@SRR1513845.100000012/1
GCGTCGTTATCCAAAA
+
>A:9:?88=<=0&>>9
@SRR1513845.100000012/2
TGGAAATATTTATTACCCCCCCCCCCCCCCCCCCCCCCCCC
+
A;>@A;4;=??8=:@;-4<?632;=:67;>=):9>9%88=9
@SRR1513845.100000016/1
CGTGGAATGGGGTGTGATTTAATTATCGAATGGCGTCCGATCCAGATT

These characters (<.@;:) are normal and influence in bwa's alignment?

Here is my bwa code:

bwa mem -M -t 38 -p hsa_GRCh38.fa SRR1513_fastqtosam.fq -o SRRR1513_aligned.bam

and my samtofastq code

java -Xmx8G -jar picard.jar SamToFastq \
I= SRR1513_fastqtosam.bam \
FASTQ= SRR1513_fastqtosam.fq \
CLIPPING_ATTRIBUTE=XT \
CLIPPING_ACTION=2 \
INTERLEAVE=true \
NON_PF=true TMP_DIR=./temp

I'm stuck in this from a few hours. Thanks in advance!

UPDATE:

I just notice a flag during bwa mem alignment

[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
  • 1
    Yes, those qualities seems to be OK. They are not high, but technically, they are correct. Which version of Picard are you using? The NON_PF flag is not documented in the last version. Anyways, I would not recover the non passing filter reads, but it is up to you and your analysis. In the Picard line, there should not be spaces around the equal signs. I assume this is not your real command, otherwise you would had got an error. Regarding BWA, there are some weird quotes that should make your line fail. Also, the output is a sam file, not a bam. Which organism are you trying to align? – Poshi Mar 16 '20 at 20:31
  • @Poshi I'm using version 2.20.1 and I'm trying to align a Homo sapiens sample. And, sorry, these quotes are not part of the code, I'm going to edit the publication – sciencenmagic Mar 16 '20 at 20:37
  • Also, I forgot to mention, these are colorspace reads – sciencenmagic Mar 16 '20 at 20:46
  • Quite a new version. I'm surprised that parameter is not in the docs. Anyways, it is not much relevant. Human sample against hs38, seems good. The eleven sequences in the sample provided does not match any organism in the `nr` DDBB in a BLAST. It looks like some kind of contaminant or problem in the read, hence the kind of low qualities. Besides that, all the procedure is OK. – Poshi Mar 16 '20 at 20:46
  • No, these are not color space. Can you see the colors? I only see the bases. – Poshi Mar 16 '20 at 20:47
  • I used cutadapt to convert "numbers to nucleotides", to not cause trouble in bwa. – sciencenmagic Mar 16 '20 at 20:50
  • 1
    See? The data you provided here is in base-space ;-) The general advise is to align color-space data in color-space. Given the way the colors represents dinucleotides, it is your best bet to get good alignments in presence of divergence (sequencing errors or mutations). Try it. – Poshi Mar 16 '20 at 20:52
  • Okay! I only did that because bwa don't support colorspace reads, so the file needs to be double-encoded. Do you know any aligner that supports it? And do you think that this replacement should be one of the causes of unmapped reads? – sciencenmagic Mar 16 '20 at 20:59
  • I updated the question – sciencenmagic Mar 16 '20 at 21:11
  • Update irrelevant: if there's nothing aligning, obviously there cannot be enough properly aligned pairs to assess the orientations and insert sizes. Those messages are expected given the percentage of non-mapping reads. – Poshi Mar 16 '20 at 21:21
  • I never worked with color space reads. Just know a little bit of its theory. But I think BWA can cope with color space (http://seqanswers.com/forums/showthread.php?t=16621). And yes, I think this replacement can be the cause. Don't forget that, after a color read error, the rest of the sequence in base space is completely different from what it should be. – Poshi Mar 16 '20 at 21:23
  • 1
    In particular, if this data happens to be a capture, and at the beginning of the region captured there's a mutation, you will always have a large portion of sequence in base-space that will not match for that region. – Poshi Mar 16 '20 at 21:24

0 Answers0