3

How can I implement sequence alignment with minimum subsequence length constraint. For example let for these inputs minimum sub-sequence length be 3. Using Smith-Waterman gives output like below.

ATACGGACC
||    |||
ATCATAACC

But instead I need like below.

---ATACGGACC
   |||   |||
ATCATA---ACC

Is there a know algorithm for this, or do you know an approach?

Thanks in advance.

Donato Szilagyi
  • 4,279
  • 4
  • 36
  • 53
denizeren
  • 934
  • 8
  • 20

2 Answers2

3

Look at how Smith-Waterman works. You have two sequences (S1 of length N, S2 of length M) and you create an NxM matrix (let's call it A) where A(i,j) is "the best score alignment of S1[1..i] and S2[1..j]." To do that, you write a recurrance about how to get A(i,j) based on the last element - A(i,j) is the best of

A(i-1, j-1) + match/mismatch score
A(i,j-1) + indel score
A(i-1,j) + indel score

Here's the basic idea; you may need to tweak it a little.

To do what you're asking, you need two matrices. Let A(i,j) be "the best score alignment of S1[1..i] and S2[1..j] ending with a match" and B(i,j) be "the best score alignment of S1[1..i] and S2[1..j] ending with an indel."

Let's start with B, because it's easier. B(i,j) is the best of

A(i,j-1) + indel score
A(i-1,j) + indel score
B(i,j-1) + indel score
B(i-1,j) + indel score

And A(i,j) is the best of

A(i-1, j-1) + match/mismatch score
B(i-3, j-3) + match/mismatch score for the three
Michael
  • 617
  • 4
  • 3
1

The algorithm effectively returns a 2D table of all possible subsequences. You must do the additional work of extracking the actual subsequences, which apparently you are doing. During alignment (back-)tracking, you can have an admission check:

if(subsequence.length() > 2)
     results.add(subsequence);

If you don't want to go as I mention, then do you mind showing the code you are using for getting the actual subsequences then maybe we can show you where to edit?

kasavbere
  • 5,873
  • 14
  • 49
  • 72