2

I have the following question about sequence alignment:

We know that global alignment algorithms are useful when you want to force two sequences to align over their entire length, and local alignment finds the region or regions of highest similarity between two sequences and build the alignment outward from there.

What is the best algorithm to find the concatenation of small sequences in a library that minimizes alignment cost when we have one sequence that is very long and a library of small sequences?

seaotternerd
  • 6,298
  • 2
  • 47
  • 58
csuo
  • 820
  • 3
  • 16
  • 31

2 Answers2

1

Let ∑ be the alphabet (e.g., {A, C, G, T}). Let L ⊆ ∑* be the set of short library sequences. Compute a minimum-state DFA (Q, ∑, ∂, q0, F) for L*.

We scan the long sequence x ∈ ∑* one letter at a time. Let x' be the prefix of x that has been consumed. We maintain, for every state q ∈ Q, the minimum cq(x') over [every sequence y ∈ ∑* such that ∂(q0, y) = q] of the Levenshtein distance between x' and y.

For the empty prefix ε, for every state q ∈ Q, it holds that cq(ε) = min {|y|: y ∈ ∑*, ∂(q0, y) = q}, since the distance between y and ε is the length of y. Compute the initial table with breadth-first search on the transition graph.

Given the table for x' and a letter s, we compute cq(x) as the minimum over several possibilities for y, where x = x' s.

  1. Strings y = y' s z, aligning the s's. The cost in this case is minq', z: ∂(q', s z) = q (cq'(x') + |z|), which can be computed by |Q| breadth-first searches.

  2. Strings y = y', deleting the s in x. The cost in this case is cq(x') + 1.

  3. Strings y = y' t where t is a letter, substituting s for t (or vice versa). The cost in this case is minq', t: ∂(q', t) = q (cq'(x') + 1).

At the end, the optimal alignment cost is minq ∈ F cq(x). The alignment can be reconstructed in the usual way for dynamic programs.

Per
  • 2,594
  • 12
  • 18
0

One naive approach would be to try every permutation. If S is the set of permutations of each small sequence in the library, you could align the large sequence with every sequence in S, one by one, and see which one has the minimum alignment cost. Unfortunately, this won't be CPU-friendly as the size of S would be exponential in the number of small sequences.

Murat Derya Özen
  • 2,154
  • 8
  • 31
  • 44