unmapped reads using bwa

Question

i'm trying to use BWA MEM to align some WGS files, but i notice something strange. When I used samtools flagstat to check these .bam files, I notice that most reads were unmapped.

76124692 + 0 in total (QC-passed reads + QC-failed reads)
308 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
708109 + 0 mapped (0.93% : N/A)
76124384 + 0 paired in sequencing
38062192 + 0 read1
38062192 + 0 read2
0 + 0 properly paired (0.00% : N/A)
12806 + 0 with itself and mate mapped
694995 + 0 singletons (0.91% : N/A)
11012 + 0 with mate mapped to a different chr
1682 + 0 with mate mapped to a different chr (mapQ>=5)

Previously, I used Samtofastq to convert my .bam file to .fastq. When I head this file, this is shown:

@SRR1513845.100000000/1
AACGAAACGAAAAGAAAAGAAAAGAAAGAAAAAGAAAGGAACAGAAAAG
+
AAA?=>'2&)&)&&))2(-'(,.%)&31%%'6/6,(1,501046124&6
@SRR1513845.100000000/2
AATTAATTAAGCCCCGAAGGAAGCGAGAAACACTG
+
AAA?B=AB@A@A=?A>AA@?.@?8<.1;><*17?<
@SRR1513845.100000001/1
TATAACCATATAACAAATCCAAGCCCAACAGAGAAGAGAAACAAAAAGA
+
>27<@>&856;.'.&9.%>%::-5194&:+'5);;%1&'/%%999%5(8
@SRR1513845.100000001/2
TCCAACTGATATCGTAATT
+
@3<@A>:8;?:383>=3:=
@SRR1513845.100000003/1
TATCGGTCTTGTTTAG
+
=1;=6?(4>4A13?0A
@SRR1513845.100000003/2
TTCAGGTGCCTCGAAGTTGGATAAGG
+
==>>9@;?3<A5>7);)<9-<25<9?
@SRR1513845.100000004/1
GTCATTTAGCCCAAGAGAATGGC
+
BB@ABA@@A?</A>>25A;@4:5
@SRR1513845.100000004/2
GGAGATCGAGTCAAATTTTATGCTAGGTAT
+
%A:<@7A@@=4AA?7<A5>@;3&?>>:;:>
@SRR1513845.100000012/1
GCGTCGTTATCCAAAA
+
>A:9:?88=<=0&>>9
@SRR1513845.100000012/2
TGGAAATATTTATTACCCCCCCCCCCCCCCCCCCCCCCCCC
+
A;>@A;4;=??8=:@;-4<?632;=:67;>=):9>9%88=9
@SRR1513845.100000016/1
CGTGGAATGGGGTGTGATTTAATTATCGAATGGCGTCCGATCCAGATT

These characters (<.@;:) are normal and influence in bwa's alignment?

Here is my bwa code:

bwa mem -M -t 38 -p hsa_GRCh38.fa SRR1513_fastqtosam.fq -o SRRR1513_aligned.bam

and my samtofastq code

java -Xmx8G -jar picard.jar SamToFastq \
I= SRR1513_fastqtosam.bam \
FASTQ= SRR1513_fastqtosam.fq \
CLIPPING_ATTRIBUTE=XT \
CLIPPING_ACTION=2 \
INTERLEAVE=true \
NON_PF=true TMP_DIR=./temp

I'm stuck in this from a few hours. Thanks in advance!

UPDATE:

I just notice a flag during bwa mem alignment

[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs

Yes, those qualities seems to be OK. They are not high, but technically, they are correct. Which version of Picard are you using? The NON_PF flag is not documented in the last version. Anyways, I would not recover the non passing filter reads, but it is up to you and your analysis. In the Picard line, there should not be spaces around the equal signs. I assume this is not your real command, otherwise you would had got an error. Regarding BWA, there are some weird quotes that should make your line fail. Also, the output is a sam file, not a bam. Which organism are you trying to align? — Poshi, Mar 16 '20 at 20:31
@Poshi I'm using version 2.20.1 and I'm trying to align a Homo sapiens sample. And, sorry, these quotes are not part of the code, I'm going to edit the publication — sciencenmagic, Mar 16 '20 at 20:37
Quite a new version. I'm surprised that parameter is not in the docs. Anyways, it is not much relevant. Human sample against hs38, seems good. The eleven sequences in the sample provided does not match any organism in the `nr` DDBB in a BLAST. It looks like some kind of contaminant or problem in the read, hence the kind of low qualities. Besides that, all the procedure is OK. — Poshi, Mar 16 '20 at 20:46
No, these are not color space. Can you see the colors? I only see the bases. — Poshi, Mar 16 '20 at 20:47
I used cutadapt to convert "numbers to nucleotides", to not cause trouble in bwa. — sciencenmagic, Mar 16 '20 at 20:50
See? The data you provided here is in base-space ;-) The general advise is to align color-space data in color-space. Given the way the colors represents dinucleotides, it is your best bet to get good alignments in presence of divergence (sequencing errors or mutations). Try it. — Poshi, Mar 16 '20 at 20:52
Okay! I only did that because bwa don't support colorspace reads, so the file needs to be double-encoded. Do you know any aligner that supports it? And do you think that this replacement should be one of the causes of unmapped reads? — sciencenmagic, Mar 16 '20 at 20:59
Update irrelevant: if there's nothing aligning, obviously there cannot be enough properly aligned pairs to assess the orientations and insert sizes. Those messages are expected given the percentage of non-mapping reads. — Poshi, Mar 16 '20 at 21:21
I never worked with color space reads. Just know a little bit of its theory. But I think BWA can cope with color space (http://seqanswers.com/forums/showthread.php?t=16621). And yes, I think this replacement can be the cause. Don't forget that, after a color read error, the rest of the sequence in base space is completely different from what it should be. — Poshi, Mar 16 '20 at 21:23
In particular, if this data happens to be a capture, and at the beginning of the region captured there's a mutation, you will always have a large portion of sequence in base-space that will not match for that region. — Poshi, Mar 16 '20 at 21:24

unmapped reads using bwa

0 Answers0