Questions tagged [sequence-alignment]

A type problem in which two or more sequences need to be lined up with each other, generally for the purposes of identifying similarities between them. These problems are common in bioinformatics, but the algorithms used to solve them are just as relevant to aligning other types of sequences, such as text strings. A variety of algorithms have been developed for dealing with various sub-sets of this problem.

Sequence alignment problems are a group of problems in which you have two or more sequences, generally with some potentially similar portions, that you want to line up so that the similar portions of each are associated. This is often an important component of calculating the similarity of the sequences.

Sequence alignment is frequently important in , in which sequences of DNA, RNA, or amino acids must be aligned in order to infer what mutations occurred where and when. However, sequence alignment problems occur in all domains in which there are sequences, such as in text matching.

Dynamic programming is the most commonly used technique for aligning two sequences (for multiple sequence alignment, see below). Starting from the first element of each string, each pair of elements is either aligned (if they match) or dealt with with one of the operators described below (if they don't match). For more detail, see this page.

The exact dynamic programming algorithm used depends on the specific problem at hand. Here are the two most common:

  • Needleman-Wunsch - Global alignment of two sequences (i.e. all letters in both sequences need to be used)
  • Smith-Waterman - Local alignment of two sequences (i.e. only a subsequence of each string needs to be used)

Three main operations are generally allowed in sequence alignment. It's easiest to think of these operations as things that might have happened to one of the sequences to turn it into the other sequence:

  • Insertion: An element is inserted into one of the sequences. This is generally represented by adding a gap to the opposite sequence.
  • Deletion: An element is removed from one of the sequences. This is generally inserted by adding a gap to that sequence.
  • Mutation/Substitution: An element is replaced with a different element.

Each of these operations have a cost associated with them to reflect how likely it was that they would have happened to the original sequence. Mutation/Substitution generally has a different cost for different substitutions (often generated from a BLOSUM or PAM matrix in bioinformatics). Insertion and deletion accounted for with some sort of gap penalty. In simple implementations, this penalty is often a constant cost per gap, but in bioinformatics an affine gap penalty is often more appropriate.

Multiple Sequence Alignment: Dynamic programming quickly becomes computationally intractable expensive as the number of sequences being aligned increases. For this reason, multiple sequence alignment algorithms generally do not guarantee optimality. A variety of techniques are used:

  • Progressive alignment: In this technique, a series of pairwise alignments are used to create an overall multi-way alignment. Often the order in which these alignments are performed is determined by a hierarchical clustering algorithm like neighbor-joining or UPGMA. A number of tools exist for performing such alignments in bioinformatics, such as the Clustal family and T-Coffee.
  • Heuristic approaches: A wide variety of heuristics can be used for very large scale multiple sequence alignment. Blast is by far the most popular tool for this in the case of bioinformatics.
  • Hidden-Markov Models: HMMs can be used to find the most likely alignments for a set of sequences. HMMER is a popular bioinformatics tool for this approach.
131 questions
4
votes
2 answers

Multiple Sequence Alignment (Longest Common Subsequence)?

OK this is what I want to do: Get more than two strings and "align" them (no DNA/RNA sequence or the like, just regular strings with not like 1000 items in each of them) I've already done some work with pairwise alignment (align two strings) however…
Dr.Kameleon
  • 22,532
  • 20
  • 115
  • 223
3
votes
1 answer

R How to visualize pairwise alignment

How to visualize the complete alignment of two sequences? library(Biostrings) s1…
Prradep
  • 5,506
  • 5
  • 43
  • 84
3
votes
2 answers

Errors with the align_local function in R

I am trying to compare two gene sequences: sequence_1 <-…
Murph
  • 33
  • 2
3
votes
2 answers

How does the Needleman Wunsch algorithm compare to brute force?

I'm wondering how you can quantify the results of the Needleman-Wunsch algorithm (typically used for aligning nucleotide/protein sequences). Consider some fixed scoring scheme and two sequences of varying length S1 and S2. Say we calculate every…
3
votes
1 answer

Is there an error in Gusfield's description of the dynamic programming algorithm for finding global alignments with constant gap penalty?

Gusfield (Algorithms on Strings, Trees, and Sequences, Section 11.8.6) describes a dynamic programming algorithm for finding the best alignment between two sequences A and B under the assumption that the penalty assigned to a gap of length x in one…
3
votes
1 answer

How to filter alignment columns based on list of position in biopython?

Based on the biopython help page here, I can filter the alignment columns based on first or last 10, I can even piece together subalignment using align[:, :10] + align[:, -10:] align being an MSA object, generated using from Bio import…
msakya
  • 9,311
  • 5
  • 23
  • 31
3
votes
2 answers

Smallest list containing all elements from two lists, while preserving order

I am unsure how to combine the items from two lists of integers such that the order of the items is preserved and the resultant list, if concatenated into one integer, is as small as possible. Potentially similar to this question, although the…
The_Unobsequious
  • 277
  • 1
  • 2
  • 10
3
votes
1 answer

How to refine a python script for a bioinformatics query

I am quite new to python and I would be grateful for some assistance if possible. I am comparing the genomes of two closely related organisms [E_C & E_F] and trying to identify some basic insertions and deletions. I have run a FASTA pairwise…
sheaph
  • 199
  • 1
  • 2
  • 10
3
votes
2 answers

Sequence alignment with minimum subsequence length constraint

How can I implement sequence alignment with minimum subsequence length constraint. For example let for these inputs minimum sub-sequence length be 3. Using Smith-Waterman gives output like below. ATACGGACC || ||| ATCATAACC But instead I need…
denizeren
  • 934
  • 8
  • 20
3
votes
1 answer

Looking for algorithm to do long pair wise nucleotide alignments

I am trying to scan for possible SNPs and indels by aligning scaffolds to subsequences from a reference genome. (the raw reads are not available). I am using R/bioconductor and the `pairwiseAlignment function from the Biostrings package. This was…
2
votes
2 answers

sequence alignment

I have the following question about sequence alignment: We know that global alignment algorithms are useful when you want to force two sequences to align over their entire length, and local alignment finds the region or regions of highest similarity…
csuo
  • 820
  • 3
  • 16
  • 31
2
votes
0 answers

Overlap Alignment (overlaps between error-prone reads)

Find a highest-scoring overlap alignment between two strings. Input: A match score m, a mismatch penalty μ, a gap penalty σ, and two DNA strings s and t. Output: The maximum alignment score of an overlap alignment between s and t followed by an…
2
votes
1 answer

BWA fail to locate the index files

I'm currently working on trying to analyze a dataset. I'm new to the field of bioinformatics and was trying to use BWA tools, however, as soon as I reach bwa mem, I keep running into the same error: input --> mirues-macbook:sra ipmiruek$ bwa mem -t…
Mirue Kang
  • 23
  • 3
2
votes
2 answers

How to do multiple sequence alignment of text strings (utf8) in R

Given three strings: seq <- c("abcd", "bcde", "cdef", "af", "cdghi") I would like to do multiple sequence alignment so that I get the following result: abcd bcde cdef a f cd ghi Using the msa() function from the msa package I…
WJH
  • 539
  • 5
  • 14
2
votes
0 answers

How to correctly implement sequence alignment

I am trying to create a sequence alignment program and am re purposing some code I found online but am struggling to find out why my chart outputs correct sometimes, particularly near the end as I always seem to get the correct alignment score, but…
1
2
3
8 9