I am looking to show how a primer is consistent among some genomic data. I have a primer of about 23bp and looking to compare it to about 5000 genomic sequences of 10kb. Since that is too much for my computer to do, I wanted to do that following:
> 1. Cut out the area that my primer is located and about 20bp down each end.
> 2. Show only the bases that are different from my primer in my analysis.
> ex: Primer: -----------ATGTGGAAGCAAATATCAAATGA---------
> Gene: ATGACCATACG----C--------------T---ATCGTAGGG
> ATGAGCATACC-----A----T--------T---TTCGAACGC
The data I am using is all dengue sequences (all serotypes) and the primer with the following code: ATGTGGAAGCAAATATCAAATGA.
I was trying to somehow use the msa() function and only look at the part of the gene of interest. However, it was difficult because to accurately predict if you would need to have it aligned.
I was still thinking of maybe cutting out an arbitrary number around that part of the gene and aligning it, but could not figure a way out to demonstrate it properly and also thought others might have suggestions for better way to do it.
I am using Biostrings, msa, and seqinr. I use ncbi to get the genetic sequences and using FASTA files.
Thanks!