0

I am using an Algorithm equivalent to the Needleman-Wunsch Algorithm to do fuzzy sequence matching using a similarity matrix.

Some of the results are near optimal:

SIL d   e:  n   SIL A+  r   t   i:  k   E+  l   SIL SIL A+  f   t   @   SIL b   u:  @   n   @   SIL aU  s
-   d   e:  n   -   -   @   t   e:  k   9   l   SIL "   A+  f   d   @   -   b   9   A+  n   @   SIL aU  s

But some are not:

SIL d   E+  r   SIL I+  n   h   A+  l   t   SIL S+  t   e:  t   SIL u:
-   -   -   -   -   -   -   z   I+  -   k   -   -   -   -   f   -   - 

The problem occurs around deletions and insertions: The algorithm aligns single letters from near the deletion, which hardly match to the missing parts.

I have already tried to penalize the beginning of gaps, so that the algorithm favors large gaps over small ones. The results were horrible, because as you can see above, gaps of length 1 and 2 are very common in the correctly aligned parts.

How to modify the algorithm to avoid doing these wrong alignments consisting of spread out letters with bad scores (such as the f in - - - - f - -, which should obviously be just another -)?

Edit: For those of you who are not familiar with the Algorithm: When the scores are calculated, the way which will be taken is not known, because the way depends on guess what: The scores.

This means when calculating the scores I can not take into account the neighboring alignments, because they are unknown. But if an alignment is good enough or not depends on the neighbors: If a pair is a bad fit (remember: i use a similarity matrix filled with probabilities) and surrounded by gaps, it should get a very bad score (see second example). If it is surrounded by other, better fitting pairs, it should get a good score (see first example).

So I am having a bit of a chicken and egg problem when calculating the scores.

Zotta
  • 2,513
  • 1
  • 21
  • 27
  • It's unclear what you want to do. For example, the algorithm cannot simply change the `f` in your last example to a `-`: that `f` must be in the second string, so the algorithm has to put it *somewhere*. – j_random_hacker May 31 '15 at 17:34
  • @j_random_hacker: The examples are both part of the same alignment of ~2200 items in each sequence. Also, it is not required that everything is aligned to something. If no good alignment is found, the algorithm simply inserts a `-` in the upper sequence and puts the symbol underneath. The actual problem is that the algorithm assigns the pair of f and t a better score than a gap in each sequence. – Zotta May 31 '15 at 18:28
  • If you want to put the `f` opposite a `-` in that example, then the solution is very simple: ensure that the gap penalty is lower than every mismatch penalty. (For simplicity this assumes no gap-begin penalties -- if they are present, it suffices to ensure that the total cost of opening a gap is less than every mismatch penalty.) – j_random_hacker May 31 '15 at 18:32
  • @j_random_hacker: I tried that already. The results are terrible. The f will indeed be paired up with a -. However, lots of other bad fits which are in between good fits will be converted to gaps, too. That is bad. I basically need the scores to depend on context. Read the edit I made to the original post. – Zotta May 31 '15 at 18:43
  • I'm very familiar with algorithm, but I don't understand what you mean by "calculating the scores" in your edit. By "scores", do you mean the score of an optimal alignment, which is calculating by the DP from the matrix of pairwise similarity scores? Or do you mean the calculation of the pairwise similarity score matrix itself? – j_random_hacker May 31 '15 at 18:48
  • When I am doing this: match_score = score_matrix[x-1][y-1] + similarity[list1[x]][list2[y]] ... I want to take into account what the previous as well as the following path is. This is obviously not possible. So I need another way around that. – Zotta May 31 '15 at 19:16
  • 2
    I think that's a red herring -- there's really no inherent directionality in the NW algorithm (e.g., if you reverse both strings, you will get an alignment of exactly equal score -- and if it happens there is only one alignment with this optimal score, you will get the exact same alignment, but reversed). But I think I see the general problem you have. If you want to automatically classify stretches of "good matches", alignment is not a powerful enough tool. In that case, I'd suggest looking into hidden Markov models, which explicitly model this idea. – j_random_hacker May 31 '15 at 20:28

0 Answers0