-1

Is there a way I can find number of variable sites for a pair of nucleotide sequences using R and also which sites are variable ?

For eg.

Seq1=ATGCCCCGTAATGC
Seq2=AGGCAACGTAAAGC

I want the number of variable sites as 5 and the positions 2,3,5,6,12 as variable. Is there a way to do this in R ?

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Anurag Mishra
  • 1,007
  • 6
  • 16
  • 23

2 Answers2

1

Try Biostrings

library(Biostrings)
s1 <- as.integer(DNAString(Seq1))
s2 <- as.integer(DNAString(Seq2))
which(s1!=s2)
#[1]  2  5  6 12

Benchmarks

 Seq1 <- paste(rep(Seq1,1e6), collapse='')
 Seq2 <- paste(rep(Seq2,1e6), collapse='')

 f1 <- function(){a <- unlist(strsplit(Seq1, ""))
             b <- unlist(strsplit(Seq2, ""))
             which(a!=b)}

 f2 <- function(){s1 <- as.integer(DNAString(Seq1))
             s2 <- as.integer(DNAString(Seq2))
             which(s1!=s2)}

 library(microbenchmark)
 microbenchmark(f1(), f2(), unit='relative', times=20L)
 #Unit: relative
 #expr     min       lq     mean   median       uq      max neval cld
 #f1() 5.06205 5.180931 4.480188 4.633281 4.749703 2.365516    20   b
 #f2() 1.00000 1.000000 1.000000 1.000000 1.000000 1.000000    20  a 

data

Seq1 <- 'ATGCCCCGTAATGC'
Seq2 <- 'AGGCAACGTAAAGC'
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi, s1=as.integer(DNAString(Seq1)) gives me the following error : cannot coerce type 'S4' to vector of type 'integer' – Anurag Mishra Nov 14 '14 at 11:19
  • @Anurag Mishra I am using `Biostrings_2.32.1` on `R version 3.1.2`. Please check if it is related to the version. – akrun Nov 14 '14 at 14:15
0

Not sure whether Biostrings have any benefit on memory optimization. You simply use strsplit for short sequences as:

Seq1="ATGCCCCGTAATGC"
Seq2="AGGCAACGTAAAGC"

a <- unlist(strsplit(Seq1, ""))
b <- unlist(strsplit(Seq2, ""))
which(a!=b)
#[1]  2  5  6 12
Beisi Xu
  • 56
  • 5