Smith-Waterman algorithm traceback part in R

Question

Goodmorning everyone,

I am trying to implement my own Smith-Waterman algorithm using R programming, but I am having some trouble in the traceback part, that is, given a matrix m storing the scores, I have to select the cell with the maximum value and then going left, up or diagonal in order to select the cells with the highest values, until we met zero.

Given two strings a and b, a match value, a mismatch value and a gap value:

a_splitted <- unlist(strsplit(a, "")) 
b_splitted <- unlist(strsplit(b, "")) 
   
cols <- nchar(a) + 1 
rows <- nchar(b) + 1
   
m <- matrix(0, rows, cols) 

coln <- c("", a_splitted) 
rown<- c("", b_splitted)
colnames(m) <- coln
rownames(m) <- rown

m1 <- m

for (i in 2:nrow(m)){
  for (j in 2:ncol(m)){
    score1 <- m[i,j-1] + gap
    score2 <- m[i-1,j] + gap
    if(colnames(m)[j] == rownames(m)[i]){
      score3 <- m[i-1,j-1] + match 
    } else {
      score3 <- m[i-1,j-1] + mismatch
    } 
       
    scores <- c(score1, score2, score3)
    m[i,j] <- max(scores)
    m1[i,j] <-which(scores == m[i,j])[1]
    if (m[i,j]<0){
      m[i,j] <- 0
    }
    }
  }

' This is what I have done for the traceback part, but it takes hours to run and I still don't know if it really works.

Traceback:

i <- length(a_splitted2)
  j <- length(b_splitted2)
  ax <- character()
  ay <- character()
  while (m1[i -1,j -1] != 0){
    ## case 1: best was seq.x[i] aligned to seq.y[j]
    s <- m1[i-1,j-1]
    if (a_splitted2[i] == b_splitted2[j]) {
      s <- s + match
    } else {
      s <- s + mismatch
    }
    if (s == m1[i,j]) {
      ax <- c(a_splitted2[i], ax)
      ay <- c(b_splitted2[j], ay)
      i <- i - 1
      j <- j - 1
      next
    }
    ## case 2: best was seq.x[i] aligned to -
    if ((m1[i-1,j] + gap) == m1[i,j]) {
      ax <- c(a_splitted2[i], ax)
      ay <- c("-", ay)
      i <- i-1
      next
    }
    ## case 3: best was seq.y[j] aligned to -
    if ((m1[i,j-1] + gap) == m1[i,j]) {
      ax <- c("-", ax)
      ay <- c(b_splitted2[j], ay)
      j <- j-1
      next
    }
  }

Any hint is appreciated. Thank You a lot,

Davide

score 0 · Answer 1 · answered Sep 12 '22 at 19:33

I will give you an idea for the backtracking algorithm here. First I will clarify some things to make it easier to follow for other readers.

Your matrix m is the score-matrix, m1 is the pointer-matrix. The pointer-matrix, as you implemented it, points always to exactly one direction using the following schema for backtracking:
   1 means "go left"
   2 means "go upwards"
   3 means "go upwards left"
Note, that in general it is possible, that going into two different directions (e.g. left and upwards) leads to the same score and therefore both ways are possible, but would lead to different alignments. As a result your implementation will not track all possible ways.

I also wrote the backtracking code just for one of multiple possible optimal alignments. But this can easily be improved by iterating over all maxima in the score-matrix.

Now to the code: In the initialization we get the row and column index of the first field in the cost-matrix with maximum value. This is where we need to start backtracking. Then it starts the actual backtracking in the while-loop as long as we stay in fields that have a score greater zero in the score-matrix, which means some kind of alignment we want.
In every iteration there are the mentioned three possible ways where to go next: left, upwards or upwards left. In cases 1 (left) and 2 (upwards) we need to insert gaps in the alignment-sequences b and a respectively. I used "-" to mark a gap here. In case 3 we have a match or mismatch of both symbols at the respective position. And that's it.

# get row and column of the field with max value
max_idx <- which(m==max(m), arr.ind = TRUE)[1,]

# indices of the current field in backtracking
i <- max_idx["row"]
j <- max_idx["col"]

# alignment sequences
a_align <- c()
b_align <- c()

# actual backtracking
while (m[i,j] > 0) {
  direction <- m1[i,j]
  switch (direction,
    # case 1 - gap in b
    {
      j <- j-1
      a_align <- c(a_splitted[j], a_align)
      b_align <- c("-", b_align)
    },
    # case 2 - gap in a
    {
      i <- i-1
      a_align <- c("-", a_align)
      b_align <- c(b_splitted[i], b_align)
    },
    # case 3 - match / mismatch
    {
      i <- i-1; j <- j-1;
      a_align <- c(a_splitted[j], a_align)
      b_align <- c(b_splitted[i], b_align)
    }
  )
}

# result: an optimal local alignment of the input sequences
a_align
b_align

It looks like your code shares basically the same idea but lecks in some problems in the initialization at the beginning and the condition in the while-loop.

Finally, I add an example result using the following sequences and costs:

a <- "hihelo"
b <- "yellow"

gap <- 0
match <- 1
mismatch <- -1

This leads to the output

> a_align
[1] "e" "l" "-" "o"
> b_align
[1] "e" "l" "l" "o"

And one can very that the score of this alignment is 1+1+0+1=3 as the maximum value in the score-matrix.

Smith-Waterman algorithm traceback part in R

1 Answers1