Questions tagged [sequence-alignment]

A type problem in which two or more sequences need to be lined up with each other, generally for the purposes of identifying similarities between them. These problems are common in bioinformatics, but the algorithms used to solve them are just as relevant to aligning other types of sequences, such as text strings. A variety of algorithms have been developed for dealing with various sub-sets of this problem.

Sequence alignment problems are a group of problems in which you have two or more sequences, generally with some potentially similar portions, that you want to line up so that the similar portions of each are associated. This is often an important component of calculating the similarity of the sequences.

Sequence alignment is frequently important in , in which sequences of DNA, RNA, or amino acids must be aligned in order to infer what mutations occurred where and when. However, sequence alignment problems occur in all domains in which there are sequences, such as in text matching.

Dynamic programming is the most commonly used technique for aligning two sequences (for multiple sequence alignment, see below). Starting from the first element of each string, each pair of elements is either aligned (if they match) or dealt with with one of the operators described below (if they don't match). For more detail, see this page.

The exact dynamic programming algorithm used depends on the specific problem at hand. Here are the two most common:

  • Needleman-Wunsch - Global alignment of two sequences (i.e. all letters in both sequences need to be used)
  • Smith-Waterman - Local alignment of two sequences (i.e. only a subsequence of each string needs to be used)

Three main operations are generally allowed in sequence alignment. It's easiest to think of these operations as things that might have happened to one of the sequences to turn it into the other sequence:

  • Insertion: An element is inserted into one of the sequences. This is generally represented by adding a gap to the opposite sequence.
  • Deletion: An element is removed from one of the sequences. This is generally inserted by adding a gap to that sequence.
  • Mutation/Substitution: An element is replaced with a different element.

Each of these operations have a cost associated with them to reflect how likely it was that they would have happened to the original sequence. Mutation/Substitution generally has a different cost for different substitutions (often generated from a BLOSUM or PAM matrix in bioinformatics). Insertion and deletion accounted for with some sort of gap penalty. In simple implementations, this penalty is often a constant cost per gap, but in bioinformatics an affine gap penalty is often more appropriate.

Multiple Sequence Alignment: Dynamic programming quickly becomes computationally intractable expensive as the number of sequences being aligned increases. For this reason, multiple sequence alignment algorithms generally do not guarantee optimality. A variety of techniques are used:

  • Progressive alignment: In this technique, a series of pairwise alignments are used to create an overall multi-way alignment. Often the order in which these alignments are performed is determined by a hierarchical clustering algorithm like neighbor-joining or UPGMA. A number of tools exist for performing such alignments in bioinformatics, such as the Clustal family and T-Coffee.
  • Heuristic approaches: A wide variety of heuristics can be used for very large scale multiple sequence alignment. Blast is by far the most popular tool for this in the case of bioinformatics.
  • Hidden-Markov Models: HMMs can be used to find the most likely alignments for a set of sequences. HMMER is a popular bioinformatics tool for this approach.
131 questions
0
votes
0 answers

Emboss needle() warning: "Sequence Character not found in ajSeqCvtKS" ...?

I am using EMBOSSwin's needle() command line function which performs pairwise global alignments, but I encounter a strange warning. So I have 24 pairs of amino acid sequences that need aligning, I run the needle() command from python using…
0
votes
2 answers

AlignIO gives 'AssertionError' when reading emboss alignment files

I have been stuck on a problem for three days... searched everywhere, posted on Biostar, still waiting for EMBL to respond to emails... would make a bounty if I had more rep. After aligning sequences with EMBOSSwin needle() (pairwise global…
0
votes
1 answer

R Genome Alignment Viewer

Currently, I have read in a genbank ptt file and used it to plot a genome in R using genoplotR plot_gene_map(dna_segs=list(mo),xlims=xlims,annotations=annotMED,annotation_height=5,main="Region",gene_type="side_blocks",dna_seg_scale=TRUE,…
0
votes
1 answer

Identifying the first byte in each block of USART data

Is there an accepted/effective means for designating/identifying the first byte in each block of a stream of 8 bit data where the blocks update and repeat? I am using GCC. These are control settings data being passed over a USART between two uC, and…
0
votes
1 answer

Multiple sequence alignment - appending to an alignment

I have a set of 520 influenza sequences for which I have already done multiple sequence alignment, and computed the pairwise identity matrix. If I'd like to add in another sequence, I have to re-align everything, and recompute the entire PWI matrix.…
ericmjl
  • 13,541
  • 12
  • 51
  • 80
-1
votes
1 answer

BWA-mem and sambamba read group line error

This is a two-part question: help interpreting an error; help with coding. I'm trying to run bwa-mem and sambamba to aling raw reads to a reference genome and to sort by position. These are the commands I'm using: bwa mem \ -K 100000000 -v 3…
-1
votes
1 answer

Implementing Smith-Waterman algorithm for local alignment in python

I have created a sequence alignment tool to compare two strands of DNA (X and Y) to find the best alignment of substrings from X and Y. The algorithm is summarized here (https://en.wikipedia.org/wiki/Smith–Waterman_algorithm). I have been able to…
-1
votes
3 answers

global alignment sequence function

I'm trying to implement the Needleman-Wunsch algorithm to get the minimum score in the global alignment function, but instead of getting the minimum score of 0 when both the sequences are equal I get 8. What is the problem with this code? alphabet =…
-1
votes
2 answers

Clustal Omega in Command Line

Below is what I am getting when I type ./configure on terminal while inside the clustal omega package. Welcome to Clustal Omega - version 1.2.1 (AndreaGiacomo) +NMMMMMMMMMS= MMMMM? :MMMMM8 …
-3
votes
1 answer

Globally align two strings and return the index of mismatches and inserted/missing characters in python

Suppose I have two strings of equal length: s1 = 'tommy' s2 = 'tammi' How would I write a function that would return the index of the mismatch, like so: s1 = 'tommy' s2 = 'tammi' mismatch = Get_Misalignment_Index(s1, s2) print(mismatch) [1,…
jhurst5
  • 67
  • 1
  • 10
-3
votes
5 answers

html/css, changing each letter of text?

Is it possible to change color of each letter of a text, for example, I print on screen in tags text, and i want to iterate to every letter, check its value and change its color accordingly, is that possible in using html/css or javascript to add…
tan
  • 439
  • 1
  • 7
  • 10
1 2 3
8
9