Questions tagged [sequence-alignment]

A type problem in which two or more sequences need to be lined up with each other, generally for the purposes of identifying similarities between them. These problems are common in bioinformatics, but the algorithms used to solve them are just as relevant to aligning other types of sequences, such as text strings. A variety of algorithms have been developed for dealing with various sub-sets of this problem.

Sequence alignment problems are a group of problems in which you have two or more sequences, generally with some potentially similar portions, that you want to line up so that the similar portions of each are associated. This is often an important component of calculating the similarity of the sequences.

Sequence alignment is frequently important in , in which sequences of DNA, RNA, or amino acids must be aligned in order to infer what mutations occurred where and when. However, sequence alignment problems occur in all domains in which there are sequences, such as in text matching.

Dynamic programming is the most commonly used technique for aligning two sequences (for multiple sequence alignment, see below). Starting from the first element of each string, each pair of elements is either aligned (if they match) or dealt with with one of the operators described below (if they don't match). For more detail, see this page.

The exact dynamic programming algorithm used depends on the specific problem at hand. Here are the two most common:

  • Needleman-Wunsch - Global alignment of two sequences (i.e. all letters in both sequences need to be used)
  • Smith-Waterman - Local alignment of two sequences (i.e. only a subsequence of each string needs to be used)

Three main operations are generally allowed in sequence alignment. It's easiest to think of these operations as things that might have happened to one of the sequences to turn it into the other sequence:

  • Insertion: An element is inserted into one of the sequences. This is generally represented by adding a gap to the opposite sequence.
  • Deletion: An element is removed from one of the sequences. This is generally inserted by adding a gap to that sequence.
  • Mutation/Substitution: An element is replaced with a different element.

Each of these operations have a cost associated with them to reflect how likely it was that they would have happened to the original sequence. Mutation/Substitution generally has a different cost for different substitutions (often generated from a BLOSUM or PAM matrix in bioinformatics). Insertion and deletion accounted for with some sort of gap penalty. In simple implementations, this penalty is often a constant cost per gap, but in bioinformatics an affine gap penalty is often more appropriate.

Multiple Sequence Alignment: Dynamic programming quickly becomes computationally intractable expensive as the number of sequences being aligned increases. For this reason, multiple sequence alignment algorithms generally do not guarantee optimality. A variety of techniques are used:

  • Progressive alignment: In this technique, a series of pairwise alignments are used to create an overall multi-way alignment. Often the order in which these alignments are performed is determined by a hierarchical clustering algorithm like neighbor-joining or UPGMA. A number of tools exist for performing such alignments in bioinformatics, such as the Clustal family and T-Coffee.
  • Heuristic approaches: A wide variety of heuristics can be used for very large scale multiple sequence alignment. Blast is by far the most popular tool for this in the case of bioinformatics.
  • Hidden-Markov Models: HMMs can be used to find the most likely alignments for a set of sequences. HMMER is a popular bioinformatics tool for this approach.
131 questions
2
votes
2 answers

Alignment of multiple (non-biological, discrete state) sequences

I have some data that describes an ordered set of discrete events (or states). There are 34 possible states, which may occur in any order and may repeat. Each sequence of events can contain any number of events, and crucially there are more than 2…
TJC
  • 185
  • 1
  • 7
2
votes
1 answer

Replacing all of instances of a letter in a column of a FASTA alignment file

I am writing a script which can replace all of the instances of an amino acid residue in a column of a FASTA alignment file. Using AlignIO, I just can read an alignment file and extract information from it but I can't modify their sequences.…
2
votes
3 answers

ungapped index for biopython alignments

My first time using biopython. Forgive me if this is a basic question. I would like to input sequences, then align them, and be able to refer to the index position of the original sequence (ungapped) and the aligned sequence (gapped). My real world…
2
votes
1 answer

Biopython: Local alignment between DNA sequences doesn't find optimal alignment

I'm writing code to find local alignments between two sequences. Here is a minimal, working example I've been working on: from Bio import pairwise2 from Bio.pairwise2 import format_alignment seq1 = "GTGGTCCTAGGC" seq2 = "GCCTAGGACCAC" # scores for…
2
votes
3 answers

Longest subsequence in Prolog

I want to implement a predicate P(Xs,Ys,Zs) where Xs,Ys,Zs are lists. I'm new in Prolog and I can't find a way to get to the longest sequence in Xs (example. Xs = ['b','b','A','A','A','A','b','b']) which is included to Ys (for example Ys =…
aydo000
  • 37
  • 4
2
votes
2 answers

Record all optimal sequence alignments when calculating Levenshtein distance in Julia

I'm working on the Levenshtein distance with Wagner–Fischer algorithm in Julia. It would be easy to get the optimal value, but a little hard to get the optimal operation sequence, like insert or deletion, while backtrace from the right down corner…
2
votes
2 answers

BioPython AlignIO ValueError says strings must be same length?

Input fasta-format text file: http://www.jcvi.org/cgi-bin/tigrfams/DownloadFile.cgi?file=/opt/www/www_tmp/tigrfams/fa_alignment_PF00205.txt #!/usr/bin/python from Bio import AlignIO seq_file = open('/path/to/fa_alignment_PF00205.txt') alignment =…
O.rka
  • 29,847
  • 68
  • 194
  • 309
2
votes
1 answer

Define own alphabet and perform MultipleSequenceAlignment in biopython

I want to do a MultipleSequenceAlignment in biopython but with a self defined Alphabet. The Background is: My sequences are sequences of numeric states and there are up to 5000 states. Thus I need an alphabet with 5000 letters, e.g. '0001', '0042',…
2
votes
1 answer

Algorithm for global multiple sequence alignment using only indels

I'm writing a Sublime Text script to align several lines of code. The script takes each line, splits it by a predefined set of delimiters (,;:=), and rejoins it with each segment in a 'column' padded to the same width. This works well when all lines…
Thom Smith
  • 13,916
  • 6
  • 45
  • 91
2
votes
1 answer

Parallel Sequence Alignment Algorithm

I am working on implementing an efficient sequence alignment algorithm using parallelism in Java. I want to return all the possible positions of the sequence. Can you guys suggest an algorithm for which this is doable? I have looked into the…
2
votes
4 answers

Codon alignment via Python?

I have pairs of coding DNA sequences which I wish to perform pairwise codon alignments via Python, I have "half completed" the process. So far.. I retrive pairs of orthologous DNA sequences from genbank using Biopython package. I translate the…
2
votes
1 answer

Looping across 10 columns at a time in R

I have a dataframe with 1000 columns. I am trying to loop over 10 columns at a time and use the seqdef() function from the TraMineR package to do sequence alignment across the data in those columns. Hence, I want to apply this function to columns…
histelheim
  • 4,938
  • 6
  • 33
  • 63
2
votes
1 answer

Finding path through recursive calls : Optimal String Alignment

So I tried asking this before, but I guess I wasn't really clear enough with what I was looking for. I'm making an optimal string alignment algorithm, it's really just a dynamic programming problem. So I decided to write it recursively. The program…
atb
  • 943
  • 4
  • 14
  • 30
2
votes
2 answers

Output identical columns from multiple sequence alignment

Hello. I am writing a function to find identical columns of alignment and then store those columns in a dictionary such that key should be the column (as a string) and the value is a list containing the indexes of the columns. I have having some…
2
votes
1 answer

Protein Sequence Display

I am trying to display protein sequence alignments in a java application for a college research project. I had the idea to use a JTable with a JLabel in each cell to hold the amino acids in the sequence. I need to be able to change the background…
Pete
  • 23
  • 2
1 2
3
8 9