Questions tagged [sequence-alignment]

A type problem in which two or more sequences need to be lined up with each other, generally for the purposes of identifying similarities between them. These problems are common in bioinformatics, but the algorithms used to solve them are just as relevant to aligning other types of sequences, such as text strings. A variety of algorithms have been developed for dealing with various sub-sets of this problem.

Sequence alignment problems are a group of problems in which you have two or more sequences, generally with some potentially similar portions, that you want to line up so that the similar portions of each are associated. This is often an important component of calculating the similarity of the sequences.

Sequence alignment is frequently important in , in which sequences of DNA, RNA, or amino acids must be aligned in order to infer what mutations occurred where and when. However, sequence alignment problems occur in all domains in which there are sequences, such as in text matching.

Dynamic programming is the most commonly used technique for aligning two sequences (for multiple sequence alignment, see below). Starting from the first element of each string, each pair of elements is either aligned (if they match) or dealt with with one of the operators described below (if they don't match). For more detail, see this page.

The exact dynamic programming algorithm used depends on the specific problem at hand. Here are the two most common:

  • Needleman-Wunsch - Global alignment of two sequences (i.e. all letters in both sequences need to be used)
  • Smith-Waterman - Local alignment of two sequences (i.e. only a subsequence of each string needs to be used)

Three main operations are generally allowed in sequence alignment. It's easiest to think of these operations as things that might have happened to one of the sequences to turn it into the other sequence:

  • Insertion: An element is inserted into one of the sequences. This is generally represented by adding a gap to the opposite sequence.
  • Deletion: An element is removed from one of the sequences. This is generally inserted by adding a gap to that sequence.
  • Mutation/Substitution: An element is replaced with a different element.

Each of these operations have a cost associated with them to reflect how likely it was that they would have happened to the original sequence. Mutation/Substitution generally has a different cost for different substitutions (often generated from a BLOSUM or PAM matrix in bioinformatics). Insertion and deletion accounted for with some sort of gap penalty. In simple implementations, this penalty is often a constant cost per gap, but in bioinformatics an affine gap penalty is often more appropriate.

Multiple Sequence Alignment: Dynamic programming quickly becomes computationally intractable expensive as the number of sequences being aligned increases. For this reason, multiple sequence alignment algorithms generally do not guarantee optimality. A variety of techniques are used:

  • Progressive alignment: In this technique, a series of pairwise alignments are used to create an overall multi-way alignment. Often the order in which these alignments are performed is determined by a hierarchical clustering algorithm like neighbor-joining or UPGMA. A number of tools exist for performing such alignments in bioinformatics, such as the Clustal family and T-Coffee.
  • Heuristic approaches: A wide variety of heuristics can be used for very large scale multiple sequence alignment. Blast is by far the most popular tool for this in the case of bioinformatics.
  • Hidden-Markov Models: HMMs can be used to find the most likely alignments for a set of sequences. HMMER is a popular bioinformatics tool for this approach.
131 questions
0
votes
2 answers

Smith-Waterman Implement in python

I want to write the first part of the Smith-Waterman algorithm in python with basic functions. I found this example, but it doesn't give me what I'm looking for. def zeros(X: int, Y: int): # ^ ^ incorrect type annotations. should…
user8769986
0
votes
1 answer

Aligning lists of structs for patch analysis

I am currently reverse engineering a regularly updated multiplayer game. The networking protocol uses a custom serialization framework and I am now able to restore a lot of information about the messages that are being exchanged. For each message I…
ACB
  • 1,607
  • 11
  • 31
0
votes
0 answers

Sequence Alignment problem using Pthreads

I am trying to implement Sequence alignment problem (Needleman-Wunsch algorithm) using p-Threads. I am confused how to map the concept of multi threading in this particular serial problem. The code of serial computation of sequence alignment is…
0
votes
0 answers

How to realize Multisequence Alignment with BioPython

I have multiple strings representing protein sequences ( for example ADADAAA,ADADDDCDAA and ACCC), I want to realize MSA on those such that resulting sequences have the same length. Biopython documentation seems only to explain how to handle…
0
votes
0 answers

Alignment of two 2D vectors with different length in Python or MATLAB

I am looking for algorithms/methods for performing a forced alignment of 2D sequences that have different lengths. I have extracted data that traces the mouth movements of different people saying the same thing. Since people say things at different…
NeuralNew
  • 96
  • 1
  • 10
0
votes
1 answer

Connecting input sentences with overlapping words

The task is to connect the input sentences which are overlapping. My problem is how to remove the overlapping parts properly. Input: first line is number of sentences to be connected. Next following lines are sentences. Output: connected…
Patrick132
  • 13
  • 4
0
votes
0 answers

Bio.Align using smith-waterman local alignment causes memory leak

I have a list of permutations of the DNA sequences where the alignment score of the sequence pairs is obtained. I don't know why this process is causing memory leak when the permutation list is big, because the aligner object has created in each…
Kadu
  • 343
  • 7
  • 17
0
votes
0 answers

Implementing global sequence alignment

i am going to create a needleman-wunsch global sequence alignment. But my answer is wrong. Please help me check my code. Sometimes when two sequence match, but it will still runs the mismatch function. Ignore my poor english , thanks. #include…
derk
  • 43
  • 3
0
votes
1 answer

What is the best way to compare strings to find matching words in Python?

I have two texts, text A and text B. Text B isn't an exact copy of text A, it has a lot of special characters which aren't in text A, but it is technically the same text. I need to compare the strings and map the counterparts in Text B to their…
AdeDoyle
  • 361
  • 1
  • 14
0
votes
1 answer

How do I match tokens in similar (but not identical strings) so that I can share POS tags from one string to another?

I have a large corpus of text, split into sentences. I have two versions of each sentence, one version has POS-tagged tokens. I want to POS tag everything in version 1. I want to do this by replacing the words in version 1 with their POS-tagged…
AdeDoyle
  • 361
  • 1
  • 14
0
votes
1 answer

SAM (Sequence Alignment/Map) Format Alignment Tags

I am using samtools to remove duplicates. To mark and then remove duplicates markdup relies on ms (mate score) and MC (mate cigar) tags that fixmates provides. Does anyone knows exactly what are these tags? How is fixmates doing? Thanks for the…
0
votes
1 answer

MemoryError from BioPython's Align.PairwiseAligner()

I'm trying to write a Python3 script that performs a global alignment of two sequences, of ~ 10 kb and 11 kb length. Both are very similar to each other. (I'm trying to find the few points where they do not match, one of which I know is close to the…
CodingCat
  • 4,999
  • 10
  • 37
  • 59
0
votes
1 answer

Is there an R function that returns the alignment score of aligned DNA sequences?

I want to take two strings (DNA Sequences) and generate an alignment score. I found the DECIPHER package but that only let me generate the alignment, not the alignment score. I also tried using "Biostrings", but I was unable to generate the…
0
votes
2 answers

Extracting subset alignment based in sequence name using SeqinR

I'm trying to extract a set of aligned sequences from a sequence a sequence alignment (alignment object) with SeqinR. below the dput() of a alignment (an S4 object) structure(list(nb = 39, nam = c("Lip4", "pdb|5FRD|A", "pdb|1M33|A", "pdb|5H3H|B",…
Aureliano Guedes
  • 767
  • 6
  • 22
0
votes
0 answers

Aligning sequence and comparing it against primer

I am looking to show how a primer is consistent among some genomic data. I have a primer of about 23bp and looking to compare it to about 5000 genomic sequences of 10kb. Since that is too much for my computer to do, I wanted to do that following: > …
Colin
  • 3
  • 4
1 2 3
8 9