I am using an Algorithm equivalent to the Needleman-Wunsch Algorithm to do fuzzy sequence matching using a similarity matrix.
Some of the results are near optimal:
SIL d e: n SIL A+ r t i: k E+ l SIL SIL A+ f t @ SIL b u: @ n @ SIL aU s
- d e: n - - @ t e: k 9 l SIL " A+ f d @ - b 9 A+ n @ SIL aU s
But some are not:
SIL d E+ r SIL I+ n h A+ l t SIL S+ t e: t SIL u:
- - - - - - - z I+ - k - - - - f - -
The problem occurs around deletions and insertions: The algorithm aligns single letters from near the deletion, which hardly match to the missing parts.
I have already tried to penalize the beginning of gaps, so that the algorithm favors large gaps over small ones. The results were horrible, because as you can see above, gaps of length 1 and 2 are very common in the correctly aligned parts.
How to modify the algorithm to avoid doing these wrong alignments consisting of spread out letters with bad scores (such as the f
in - - - - f - -
, which should obviously be just another -
)?
Edit: For those of you who are not familiar with the Algorithm: When the scores are calculated, the way which will be taken is not known, because the way depends on guess what: The scores.
This means when calculating the scores I can not take into account the neighboring alignments, because they are unknown. But if an alignment is good enough or not depends on the neighbors: If a pair is a bad fit (remember: i use a similarity matrix filled with probabilities) and surrounded by gaps, it should get a very bad score (see second example). If it is surrounded by other, better fitting pairs, it should get a good score (see first example).
So I am having a bit of a chicken and egg problem when calculating the scores.