Questions tagged [sequence-alignment]

A type problem in which two or more sequences need to be lined up with each other, generally for the purposes of identifying similarities between them. These problems are common in bioinformatics, but the algorithms used to solve them are just as relevant to aligning other types of sequences, such as text strings. A variety of algorithms have been developed for dealing with various sub-sets of this problem.

Sequence alignment problems are a group of problems in which you have two or more sequences, generally with some potentially similar portions, that you want to line up so that the similar portions of each are associated. This is often an important component of calculating the similarity of the sequences.

Sequence alignment is frequently important in , in which sequences of DNA, RNA, or amino acids must be aligned in order to infer what mutations occurred where and when. However, sequence alignment problems occur in all domains in which there are sequences, such as in text matching.

Dynamic programming is the most commonly used technique for aligning two sequences (for multiple sequence alignment, see below). Starting from the first element of each string, each pair of elements is either aligned (if they match) or dealt with with one of the operators described below (if they don't match). For more detail, see this page.

The exact dynamic programming algorithm used depends on the specific problem at hand. Here are the two most common:

  • Needleman-Wunsch - Global alignment of two sequences (i.e. all letters in both sequences need to be used)
  • Smith-Waterman - Local alignment of two sequences (i.e. only a subsequence of each string needs to be used)

Three main operations are generally allowed in sequence alignment. It's easiest to think of these operations as things that might have happened to one of the sequences to turn it into the other sequence:

  • Insertion: An element is inserted into one of the sequences. This is generally represented by adding a gap to the opposite sequence.
  • Deletion: An element is removed from one of the sequences. This is generally inserted by adding a gap to that sequence.
  • Mutation/Substitution: An element is replaced with a different element.

Each of these operations have a cost associated with them to reflect how likely it was that they would have happened to the original sequence. Mutation/Substitution generally has a different cost for different substitutions (often generated from a BLOSUM or PAM matrix in bioinformatics). Insertion and deletion accounted for with some sort of gap penalty. In simple implementations, this penalty is often a constant cost per gap, but in bioinformatics an affine gap penalty is often more appropriate.

Multiple Sequence Alignment: Dynamic programming quickly becomes computationally intractable expensive as the number of sequences being aligned increases. For this reason, multiple sequence alignment algorithms generally do not guarantee optimality. A variety of techniques are used:

  • Progressive alignment: In this technique, a series of pairwise alignments are used to create an overall multi-way alignment. Often the order in which these alignments are performed is determined by a hierarchical clustering algorithm like neighbor-joining or UPGMA. A number of tools exist for performing such alignments in bioinformatics, such as the Clustal family and T-Coffee.
  • Heuristic approaches: A wide variety of heuristics can be used for very large scale multiple sequence alignment. Blast is by far the most popular tool for this in the case of bioinformatics.
  • Hidden-Markov Models: HMMs can be used to find the most likely alignments for a set of sequences. HMMER is a popular bioinformatics tool for this approach.
131 questions
1
vote
0 answers

Find maximum gap rate in Smith-Waterman algorithm

I am now working on Smith-Waterman algorithm. I understand that by increasing the gap penalty, less gap will be obtained in my final alignment but I need advice on how to control the maximum gap rate (ratio of gapped character in the detected char)?…
bill
  • 11
  • 3
1
vote
1 answer

Aligning curves along the horizontal direction

I have some 'n' experimental curves for the same experimental conditions. Due to the inherent thermal drift in the system, the data sets are not exactly aligned with each other. I am looking for a robust algorithm that would align the data-curves…
1
vote
1 answer

MuscleCommandline not working in Biopython

I need to integrate my python script with the muscle tool for multiple sequence alignment. I followed the tutorial on Biopython, here there is my code: from Bio.Align.Applications import MuscleCommandline muscle_exe = "muscle.exe" in_file =…
Guido Muscioni
  • 1,203
  • 3
  • 15
  • 37
1
vote
2 answers

Draw lines connecting points between two separate one-D plots

As title, I am working on time-series alignment, and a visualization of the alignment result is desired. To this end, I want to draw lines connecting "anchor points" generated by the alignment algorithm. np.random.seed(5) x = np.random.rand(10) …
Francis
  • 6,416
  • 5
  • 24
  • 32
1
vote
1 answer

Algorithm to align numerical sequences

Hi I have two sequence of numerical data let's say : S1 : 1,6,4,9,8,7,5 and S2 : 6,9,7,5 And i'd like to find a sequence alignment in both sense left-right and right-left. So i used 2 techniques before asking i actually used the hungarian algorithm…
1
vote
0 answers

Extract aligned sections of FASTA to new file

I've already looked here and in other forums, but couldn't find the answer to my question. I want to design baits for a target enrichment Sequencing approach and have the output of a MarkerMiner search for orthologous loci from four different…
1
vote
1 answer

Calculate (mean) sequence divergence for many sequences

I have ~13K sequences a 120 bases and I want to compare them to find things like conserved regions, a mean divergence between them or very diverging outliers. The problem is, with this number of sequences the things I tried aren't doable. So has…
voiDnyx
  • 975
  • 1
  • 11
  • 24
1
vote
1 answer

How to order multiple Fasta alignment files

I'm sure this is an easy-to-do thing, but I have very limited bioinformatic experience. I have many -100,000- FASTA files that contain alignments of different genes of the same 12 species. Each file looks something like…
1
vote
1 answer

MiPS ASM Recursion understanding problems?

Please help me understand this formula (in case anybody is wondering, that is the Needleman-Wunsch-algorithm), I am supposed to write a code that uses recursion but I don't understand how to do so, I already have the full dynamic version written, so…
1
vote
1 answer

coloring part of a sequence in format_alignment in biopython

I am using format_alignment to look for pariwise alignment between two sequences. I want to highlight part of the sequence with a different color (say between base number 40 and base number 54) in the full alignment, so that it is clear to which…
Ssank
  • 3,367
  • 7
  • 28
  • 34
1
vote
1 answer

How does Biopython determine the root of a phylogenetic tree?

There are other packages, particularly ape for R, that build an unrooted tree then allow you to root it by explicitly specifying an outgroup. In contrast, in BioPython I can directly create a rooted tree without specifying the root, so I'm…
1
vote
2 answers

Multiple sequence alignment. Convert multi-line format to single-line format?

I have a multiple sequence alignment file in which the lines from the different sequences are interspersed, as in the format outputed by clustal and other popular multiple sequence alignment tools. It looks like this: TGFb3_human_used_for_docking …
a06e
  • 18,594
  • 33
  • 93
  • 169
1
vote
0 answers

BioPerl: Annotate mismatches in an alignment

I'm reasonably new to perl and very new to BioPerl, so my apologies if this seems like a trivial question. I'm using Bio::AlignIO and Bio::SimpleAlign to generate pairwise alignments of sequences of interest to a reference sequence - in this case…
1
vote
1 answer

Multiple sequence alignment of 12 species

i need to perform MSA( multiple sequence alignment on nucleotide sequences of 12 wheat varieties. all these varieties have different length bps(base pairs).I followed this documentation of MATLAB…
1
vote
1 answer

Prove that L >= G for Local and Global alignments of a specific function

I'm taking a bioinformatics class this semester and I'm having trouble with a specific question from the book. *Given two DNA sequences, S and T, of the same length n and let the scoring function be defined as follows: match = 1, mismatch = -1,…
1 2 3
8 9