Questions tagged [sequence-alignment]

A type problem in which two or more sequences need to be lined up with each other, generally for the purposes of identifying similarities between them. These problems are common in bioinformatics, but the algorithms used to solve them are just as relevant to aligning other types of sequences, such as text strings. A variety of algorithms have been developed for dealing with various sub-sets of this problem.

Sequence alignment problems are a group of problems in which you have two or more sequences, generally with some potentially similar portions, that you want to line up so that the similar portions of each are associated. This is often an important component of calculating the similarity of the sequences.

Sequence alignment is frequently important in , in which sequences of DNA, RNA, or amino acids must be aligned in order to infer what mutations occurred where and when. However, sequence alignment problems occur in all domains in which there are sequences, such as in text matching.

Dynamic programming is the most commonly used technique for aligning two sequences (for multiple sequence alignment, see below). Starting from the first element of each string, each pair of elements is either aligned (if they match) or dealt with with one of the operators described below (if they don't match). For more detail, see this page.

The exact dynamic programming algorithm used depends on the specific problem at hand. Here are the two most common:

  • Needleman-Wunsch - Global alignment of two sequences (i.e. all letters in both sequences need to be used)
  • Smith-Waterman - Local alignment of two sequences (i.e. only a subsequence of each string needs to be used)

Three main operations are generally allowed in sequence alignment. It's easiest to think of these operations as things that might have happened to one of the sequences to turn it into the other sequence:

  • Insertion: An element is inserted into one of the sequences. This is generally represented by adding a gap to the opposite sequence.
  • Deletion: An element is removed from one of the sequences. This is generally inserted by adding a gap to that sequence.
  • Mutation/Substitution: An element is replaced with a different element.

Each of these operations have a cost associated with them to reflect how likely it was that they would have happened to the original sequence. Mutation/Substitution generally has a different cost for different substitutions (often generated from a BLOSUM or PAM matrix in bioinformatics). Insertion and deletion accounted for with some sort of gap penalty. In simple implementations, this penalty is often a constant cost per gap, but in bioinformatics an affine gap penalty is often more appropriate.

Multiple Sequence Alignment: Dynamic programming quickly becomes computationally intractable expensive as the number of sequences being aligned increases. For this reason, multiple sequence alignment algorithms generally do not guarantee optimality. A variety of techniques are used:

  • Progressive alignment: In this technique, a series of pairwise alignments are used to create an overall multi-way alignment. Often the order in which these alignments are performed is determined by a hierarchical clustering algorithm like neighbor-joining or UPGMA. A number of tools exist for performing such alignments in bioinformatics, such as the Clustal family and T-Coffee.
  • Heuristic approaches: A wide variety of heuristics can be used for very large scale multiple sequence alignment. Blast is by far the most popular tool for this in the case of bioinformatics.
  • Hidden-Markov Models: HMMs can be used to find the most likely alignments for a set of sequences. HMMER is a popular bioinformatics tool for this approach.
131 questions
0
votes
1 answer

Two closely matching files: get corresponding lines?

I'm in a situation where I'm programmatically generating LaTeX code, and I want my Synctex to point to the correct lines in the original file. The generation is basically doing template expansion, so the original files are nearly identical to the…
jmite
  • 8,171
  • 6
  • 40
  • 81
0
votes
1 answer

clip 3 signals with cross correlation (finddelay)

in Matlab I am able to clip/trim pairs of audio signals (same frequency) using finddelay as follows, so that they are aligned and have the same length: d12 = finddelay(s1,s2); if(d12 < 1) start1 = -d12+1; start2 = 1; end1 =…
0
votes
1 answer

Sequence alignment with arbitrary gap penalty

I want to align two DNA sequences in an optimal way, but I have the gap penalty function of length L, that if L is a multiple of 3, the penalty is a * L for some constant a. If L is not a multiple of 3, then the penalty is b * L for some constant b.…
Ted
  • 469
  • 4
  • 16
0
votes
1 answer

Adjusting the MAFFT command line algorithm to better account for gaps

I've been attempting to use the MAFFT command line tool as a means to identify coding regions within a genome. My general process is to align the amino acid consensus sequence of a gene to a translated reading frame of a target sequence. My method…
Ghoti
  • 737
  • 4
  • 19
0
votes
1 answer

Re-numbering residues in PDB file with biopython

I have a sequence alignment as: RefSeq :MXKQRSLPLXQKRTKQAISFSASHRIYLQRKFSH ..... Templatepdb:-----------------ISFSASHR------FSHAQADFAG I am trying to write a code that re-number residues based on this alignment in PDB file as: original pdb :…
0
votes
0 answers

Global alignment method for decimal sequence and distance

Hi I'd like to align two decimal sequences to calculate the distance between them and be able to perform an sum of them for example let S1 and S2 be these two sequences : S1=[0.568,0.469,0.3658,0.31667] S2=[0.64918,1.16] This is only a random…
0
votes
2 answers

indices of alignment for a list of strings to string

I need a function to give the indices for which a list of strings is best aligned to a larger string. For example: Given the string: text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated…
chase
  • 3,592
  • 8
  • 37
  • 58
0
votes
1 answer

Write 2 button in HTML DOM on the right side of the screen from left to right

I wrote my DOM structure like : and it will display like :…
Shivam
  • 33
  • 1
  • 9
0
votes
1 answer

why my sed script to split FASTA file is slow?

I have a 600 Mb FASTA file containing many alignments blocks from 12 species and I want to split them into smaller FASTA files containing one block each with its corresponding alignments I have a sed script that looks like this: #!/bin/bash echo for…
NKGon
  • 55
  • 8
0
votes
1 answer

When trying to create lists of CA atoms, I get the following error "key error 'CA' when executing the following code

For the following code, when I execute the code I get an error, which I've listed below. I was wondering if anyone could give me any insights into how to append the CA atoms into tag_atoms/tagged_atoms lists, which I will use for alignment. And…
0
votes
1 answer

Building NER using Sequence Alignment algorithms

Background: Wikipedia page on Sequence Alignment says that DNA Sequence Alignment algorithms can also be used for Natural Language Processing. Question: Because Named Entity Recognizer and DNA Sequence Libraries both do Approximate String Matching -…
0
votes
1 answer

Algorithm for aligning elliptical shapes

I'm looking for an algorithm to align elliptical shapes that is capable of handling "missing data." Rough sketch: In this case, we would like to align all shapes to shape #1. I looked around for "convex shape alignment" and "elliptical shape…
user2398029
  • 6,699
  • 8
  • 48
  • 80
0
votes
0 answers

Sequence Alignment: Avoid improbable alignments

I am using an Algorithm equivalent to the Needleman-Wunsch Algorithm to do fuzzy sequence matching using a similarity matrix. Some of the results are near optimal: SIL d e: n SIL A+ r t i: k E+ l SIL SIL A+ f t @ SIL b u: @…
Zotta
  • 2,513
  • 1
  • 21
  • 27
0
votes
1 answer

What is the typical size of the sequence files while conducting pairwise sequence alignments?

What is the typical size of the sequence files while conducting pairwise sequence alignments? Can we align the whole genome of organisms?
0
votes
0 answers

Visualize DNA sequence on JSP Struct Web Page

We have to visualize the DNA sequence Alignment slimier to blast visualizer like below >our project is a web based one having Java back-end with JSP,Struct Currently needed way to…
1 2 3
8
9