I want to extract mapped reads from a SAM file (from a resistome analysis using AMR++) to taxonomically classify them.
I was searching from samtools manual and stackoverflow mainly and found these steps samtools view -@ 20 -S -b SRR4454621_1.alignment.sam > SRR4454621_1.bam ## to convert SAM to BAM samtools view -@ 20 -c SRR4454621_1.bam ### to count reads: 10000126 samtools view -@ 20 -c -F 260 SRR4454621_1.bam ### to count mapped reads: 53189 samtools view -@ 20 -b -F 4 SRR4454621_1.bam > SRR4454621_1_mapped.bam ### to get mapped reads samtools view -@ 20 -c SRR4454621_1_mapped.bam ### new check to count mapped reads: 53189 samtools bam2fq SRR4454621_1_mapped.bam | seqtk seq -A > SRR4454621_1_mapped.fa ## to extract sequences grep ">" SRR4454621_1_mapped.fa | wc -l ### to check whether everything is going rigth: 53063 (lost 126 sequences)
Then I run centrifuge to classify them. centrifuge -f -x centridb/hpvc testing/SRR4454621_1_mapped.fa -S testing/SRR4454621_1_mapped.tsv -p 24 --report-file testing/SRR4454621_1_mapped_report.tsv
And my problems is that the sum of column "numReads" from SRR4454621_1_mapped_report.tsv is 107682, and I would expect to be the same that the sum of equivalent column from resistome analysis which is 51961. Where can it be the problem? Are the main steps I described above well done?
Thank you very much for your help,
Manuel