sequence alignment

Question

I have the following question about sequence alignment:

We know that global alignment algorithms are useful when you want to force two sequences to align over their entire length, and local alignment finds the region or regions of highest similarity between two sequences and build the alignment outward from there.

What is the best algorithm to find the concatenation of small sequences in a library that minimizes alignment cost when we have one sequence that is very long and a library of small sequences?

score 1 · Answer 1 · answered Dec 10 '11 at 15:06

Let ∑ be the alphabet (e.g., {A, C, G, T}). Let L ⊆ ∑* be the set of short library sequences. Compute a minimum-state DFA (Q, ∑, ∂, q₀, F) for L*.

We scan the long sequence x ∈ ∑* one letter at a time. Let x' be the prefix of x that has been consumed. We maintain, for every state q ∈ Q, the minimum c_q(x') over [every sequence y ∈ ∑* such that ∂(q₀, y) = q] of the Levenshtein distance between x' and y.

For the empty prefix ε, for every state q ∈ Q, it holds that c_q(ε) = min {|y|: y ∈ ∑*, ∂(q₀, y) = q}, since the distance between y and ε is the length of y. Compute the initial table with breadth-first search on the transition graph.

Given the table for x' and a letter s, we compute c_q(x) as the minimum over several possibilities for y, where x = x' s.

Strings y = y' s z, aligning the s's. The cost in this case is min_{q', z: ∂(q', s z) = q} (c_q'(x') + |z|), which can be computed by |Q| breadth-first searches.
Strings y = y', deleting the s in x. The cost in this case is c_q(x') + 1.
Strings y = y' t where t is a letter, substituting s for t (or vice versa). The cost in this case is min_{q', t: ∂(q', t) = q} (c_q'(x') + 1).

At the end, the optimal alignment cost is min_{q ∈ F} c_q(x). The alignment can be reconstructed in the usual way for dynamic programs.

score 0 · Answer 2 · answered Dec 10 '11 at 12:35

One naive approach would be to try every permutation. If S is the set of permutations of each small sequence in the library, you could align the large sequence with every sequence in S, one by one, and see which one has the minimum alignment cost. Unfortunately, this won't be CPU-friendly as the size of S would be exponential in the number of small sequences.

sequence alignment

2 Answers2