0

I'm new to R programming and I'm trying to write a program for Reverse and Complementary Base. The objective is to design a DNA primer. So I have a DNA sequence with bases A T C G and A complement to T; T=A;C=G;G=C. I figured out how to reverse it already, but for the Complement I can only make it answer for just 1 base but can't be all of the sequence, and I don't know how to combine the reverse and complement functions. Here is my code and I'm totally confused with it. Can someone help me with this problem? You will be my life savior!

strReverse <- function(x) 

sapply(lapply(strsplit(x, NULL), rev), paste, collapse="")
strReverse(c("ATCGGTCAATCGA"))

complement.base = function(base){ 
  if(base == 'A' | base ==  'a')   print("T") 
  if(base == 'T' | base == 't')         print("A")
  if(base == 'G' | base == 'g')     print("C")
  if(base == 'C' | base == 'c')     print("G")}
complement.base(base="A")
Avery
  • 2,270
  • 4
  • 33
  • 35
  • use `ifelse` with `%in%`. – Metrics Mar 09 '15 at 02:04
  • 1
    There's an example of a reverse complement function [here](https://fabiomarroni.wordpress.com/2008/11/13/r-function-to-reverse-and-complement-a-dna-sequence/) and [seqinr](http://cran.r-project.org/web/packages/seqinr/index.html) might be useful for you as well. – ping Mar 09 '15 at 02:06
  • 1
    The [biostrings package](http://www.bioconductor.org/packages/release/bioc/manuals/Biostrings/man/Biostrings.pdf) [PDF] has `reverseComplement` – hrbrmstr Mar 09 '15 at 02:09

2 Answers2

2

You could use Rcpp to do the operation efficiently:

library(Rcpp)
revComp.rcpp <- cppFunction(
"std::string comp(std::string x) {
  const int n = x.length();
  for (int i=0; i < n; ++i) {
    if (x[i] == 'A' || x[i] == 'a')  x[i] = 'T';
    else if (x[i] == 'T' || x[i] == 't')  x[i] = 'A';
    else if (x[i] == 'G' || x[i] == 'g')  x[i] = 'C';
    else  x[i] = 'G';
  }
  std::reverse(x.begin(), x.end());
  return x;
}")
revComp.rcpp("ATCGGTCAATCGA")
# [1] "TCGATTGACCGAT"

This seems to be somewhat faster than the relevant code from the Biostrings package (testing on a string with 13 million bases):

library(Biostrings)
x <- "ATCGGTCAATCGA"
big.x <- paste(rep(x, 1000000), collapse="")
big.x2 <- DNAString(big.x)
rev.biostr <- function(x) as.character(reverseComplement(x))
all.equal(revComp.rcpp(big.x), as.character(reverseComplement(big.x2)))
# [1] TRUE

library(microbenchmark)
microbenchmark(revComp.rcpp(big.x), as.character(reverseComplement(big.x2)))
# Unit: milliseconds
#                                     expr       min        lq      mean    median        uq      max neval
#                      revComp.rcpp(big.x)  77.21618  78.44534  84.54397  82.21002  87.49367 123.8166   100
#  as.character(reverseComplement(big.x2)) 144.13900 151.12869 170.73765 156.44300 164.41374 399.2948   100
josliber
  • 43,891
  • 12
  • 98
  • 133
1

I would actually consider using chartr from base R, with a little help from stringi to reverse the result (or the input).

myFun <- function(invec) {
  require(stringi)
  invec <- stri_reverse(invec)
  chartr(old = "AaTtGgCc", new = "TTAACCGG", invec)
}

x <- "ATCGGTCAATCGA"
myFun(x)
# [1] "TCGATTGACCGAT"

Using @josilber's sample data, it's pretty much comparable to his Rcpp approach:

all.equal(myFun(big.x), revComp.rcpp(big.x))
# [1] TRUE

library(microbenchmark)
microbenchmark(myFun(big.x), revComp.rcpp(big.x))
# Unit: milliseconds
#                 expr      min       lq     mean   median       uq      max neval
#         myFun(big.x) 349.5797 352.8197 362.3009 356.4484 362.7197 437.9556   100
#  revComp.rcpp(big.x) 359.5485 363.8615 378.3465 368.3360 386.3734 444.2901   100
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485