Reverse complementary Base

Question

I'm new to R programming and I'm trying to write a program for Reverse and Complementary Base. The objective is to design a DNA primer. So I have a DNA sequence with bases A T C G and A complement to T; T=A;C=G;G=C. I figured out how to reverse it already, but for the Complement I can only make it answer for just 1 base but can't be all of the sequence, and I don't know how to combine the reverse and complement functions. Here is my code and I'm totally confused with it. Can someone help me with this problem? You will be my life savior!

strReverse <- function(x) 

sapply(lapply(strsplit(x, NULL), rev), paste, collapse="")
strReverse(c("ATCGGTCAATCGA"))

complement.base = function(base){ 
  if(base == 'A' | base ==  'a')   print("T") 
  if(base == 'T' | base == 't')         print("A")
  if(base == 'G' | base == 'g')     print("C")
  if(base == 'C' | base == 'c')     print("G")}
complement.base(base="A")

There's an example of a reverse complement function [here](https://fabiomarroni.wordpress.com/2008/11/13/r-function-to-reverse-and-complement-a-dna-sequence/) and [seqinr](http://cran.r-project.org/web/packages/seqinr/index.html) might be useful for you as well. — ping, Mar 09 '15 at 02:06
The [biostrings package](http://www.bioconductor.org/packages/release/bioc/manuals/Biostrings/man/Biostrings.pdf) [PDF] has `reverseComplement` — hrbrmstr, Mar 09 '15 at 02:09

josliber · Answer 1 · 2015-03-09T03:45:20.360

You could use Rcpp to do the operation efficiently:

library(Rcpp)
revComp.rcpp <- cppFunction(
"std::string comp(std::string x) {
  const int n = x.length();
  for (int i=0; i < n; ++i) {
    if (x[i] == 'A' || x[i] == 'a')  x[i] = 'T';
    else if (x[i] == 'T' || x[i] == 't')  x[i] = 'A';
    else if (x[i] == 'G' || x[i] == 'g')  x[i] = 'C';
    else  x[i] = 'G';
  }
  std::reverse(x.begin(), x.end());
  return x;
}")
revComp.rcpp("ATCGGTCAATCGA")
# [1] "TCGATTGACCGAT"

This seems to be somewhat faster than the relevant code from the Biostrings package (testing on a string with 13 million bases):

library(Biostrings)
x <- "ATCGGTCAATCGA"
big.x <- paste(rep(x, 1000000), collapse="")
big.x2 <- DNAString(big.x)
rev.biostr <- function(x) as.character(reverseComplement(x))
all.equal(revComp.rcpp(big.x), as.character(reverseComplement(big.x2)))
# [1] TRUE

library(microbenchmark)
microbenchmark(revComp.rcpp(big.x), as.character(reverseComplement(big.x2)))
# Unit: milliseconds
#                                     expr       min        lq      mean    median        uq      max neval
#                      revComp.rcpp(big.x)  77.21618  78.44534  84.54397  82.21002  87.49367 123.8166   100
#  as.character(reverseComplement(big.x2)) 144.13900 151.12869 170.73765 156.44300 164.41374 399.2948   100

Do you mind adding the `stringi` + `chartr` into the benchmarks? — A5C1D2H2I1M1N2O1R2T1, Mar 09 '15 at 03:48
They are quite comparable. I didn't test the Biostrings version on my system though. — A5C1D2H2I1M1N2O1R2T1, Mar 09 '15 at 04:40
Also adding an "N" option would make it less error prone in case an input has an uknown — Ahdee, Mar 08 '18 at 03:45

A5C1D2H2I1M1N2O1R2T1 · Answer 2 · 2015-03-09T06:02:27.837

I would actually consider using chartr from base R, with a little help from stringi to reverse the result (or the input).

myFun <- function(invec) {
  require(stringi)
  invec <- stri_reverse(invec)
  chartr(old = "AaTtGgCc", new = "TTAACCGG", invec)
}

x <- "ATCGGTCAATCGA"
myFun(x)
# [1] "TCGATTGACCGAT"

Using @josilber's sample data, it's pretty much comparable to his Rcpp approach:

all.equal(myFun(big.x), revComp.rcpp(big.x))
# [1] TRUE

library(microbenchmark)
microbenchmark(myFun(big.x), revComp.rcpp(big.x))
# Unit: milliseconds
#                 expr      min       lq     mean   median       uq      max neval
#         myFun(big.x) 349.5797 352.8197 362.3009 356.4484 362.7197 437.9556   100
#  revComp.rcpp(big.x) 359.5485 363.8615 378.3465 368.3360 386.3734 444.2901   100

Reverse complementary Base

2 Answers2

Related