2

I'm currently trying to implement an algorithm in R that requires to loop through the rows and columns of a matrix and that for every cell it computes a value based on the value of previously computed cells.

Here is the code that does what I said above, it is a part of the Needleman Wunsch algorithm:

    globalSequenceAlignment <- function(seq1, seq2, match, mismatch, gap) {
    
    # splitting the sequences in order to use them as rows and columns names
    seq1_split <- unlist(strsplit(toString(seq1), ""))
    seq2_split <- unlist(strsplit(toString(seq2), ""))
    
    len1 <- length(seq1_split)
    len2 <- length(seq2_split)
    
    # creating the alignment matrix
    alignment_matrix <- matrix(0, nrow = len2+1, ncol = len1+1)
    colnames(alignment_matrix) <- c("-", seq1_split)
    rownames(alignment_matrix) <- c("-", seq2_split)
    
    # filling first row and column of the alignment matrix
    for (i in 2:ncol(alignment_matrix)) {
      alignment_matrix[1,i] <- (alignment_matrix[1,i]+(i-1))*(gap)
    }
    
    for (j in 2:nrow(alignment_matrix)) {
      alignment_matrix[j,1] <- (alignment_matrix[j,1]+(j-1))*(gap)
    }
    
    for (i in 2:ncol(alignment_matrix)) {
      for (j in 2:nrow(alignment_matrix)) {
        
        horizontal_score <- alignment_matrix[j,i-1] + gap
        vertical_score <- alignment_matrix[j-1,i] + gap
        
        if (colnames(alignment_matrix)[i] == rownames(alignment_matrix)[j]) {
          diagonal_score <- alignment_matrix[j-1,i-1] + match
        } else {
          diagonal_score <- alignment_matrix[j-1,i-1] + mismatch
        }
        
        scores <- c(horizontal_score, vertical_score, diagonal_score)
        
        alignment_matrix[j,i] <- max(scores)
        
      }
    }
    
    
    return(alignment_matrix)
  
}

a <- 'GAATC'
b <- 'CATACG'

globalSequenceAlignment(a, b, 10,-5,-4)

Using this code I get the result that I want. The problem is that with matrices with dimensions grater than 500x500 the nested loops become way too slow (running this code with a 500x500 matrix takes more or less 2 minutes).

I know that *apply functions could improve this but I couldn't achieve to use them since for computing each cell it requires that the previous ones have been computed yet.

I was wondering if there is a way to achieve the same result using *apply functions or a way to vectorize this type of code so that it's more rapid in R.

Marco
  • 53
  • 8
  • 1
    Sorry for the slow response, I edited with a function that gives you the result I need, the crucial and slow part is only the nested loop. – Marco Jul 23 '22 at 18:14
  • Hi. FYI that the Bioconductor BioStrings has a Needleman Wunsch function in the `PairwiseAlignments` function https://bioconductor.org/packages/release/bioc/vignettes/Biostrings/inst/doc/PairwiseAlignments.pdf – M.Viking Jul 23 '22 at 20:55
  • And the CRAN ftrCOOL package has a `needleman` function. https://cran.r-project.org/web/packages/ftrCOOL/ftrCOOL.pdf Which might give you what you want, or the source code might give you ideas. – M.Viking Jul 23 '22 at 20:55
  • 1
    Yes, I know there are already functions for Needleman Wunsch algorithm, I was trying to reimplement this function for a project. I'll try looking for the source code of the Needleman function of ftrCOOL, thank you! EDIT: I looked at the Needleman function but the source code is almost the same as mine so I think the performance issue would be the same. – Marco Jul 23 '22 at 22:34
  • 1
    I've tried a few ideas, and none panned out: `compiler:cmpfun()` no faster; simplified loop, no faster; using `ifelse`, slower; attempting to use `outer()` and an anonymous function - incorrect results. – M.Viking Jul 25 '22 at 01:46
  • Thanks for trying out, I'm starting to think that the only way for speeding this up "in R" is to use Rcpp to loop through the matrix with C++ for loops. I will maybe try to do this even if it was my last chance since I don't know C++ – Marco Jul 25 '22 at 07:06

1 Answers1

1

If someone would ever need this I wrote my own solution to this problem using the package Rcpp. The runtime, from about 3 minutes for sequences of 500 characters, is now about 0.3s.

I post here the code for the part of the two nested loops that you can see in the text of the question, hope that will be useful for someone.

library(Rcpp)

rcppFunction('IntegerMatrix rcpp_compute_matrices(IntegerMatrix Am, StringMatrix Dm,
                                                  StringVector seq1, StringVector seq2,
                                                  int gap, int miss, int match) {

    int nrow = Am.nrow(), ncol = Am.ncol();

    for (int i = 1; i < nrow; i++) {
      for (int j = 1; j < ncol; j++) {
        int vertical_score = Am(i-1, j) + gap;
        int horizontal_score = Am(i, j-1) + gap;
        int diagonal_score = 0;
        if (seq1[j-1] == seq2[i-1]) {
          diagonal_score = Am(i-1, j-1) + match;
        }
        else {
          diagonal_score = Am(i-1, j-1) + miss;
        }

        IntegerVector score = {vertical_score, horizontal_score, diagonal_score};

        int max_score = max(score);

        Am(i, j) = max_score;

        }
    }
    return Am;
}')
Marco
  • 53
  • 8