Questions tagged [sequence-alignment]

A type problem in which two or more sequences need to be lined up with each other, generally for the purposes of identifying similarities between them. These problems are common in bioinformatics, but the algorithms used to solve them are just as relevant to aligning other types of sequences, such as text strings. A variety of algorithms have been developed for dealing with various sub-sets of this problem.

Sequence alignment problems are a group of problems in which you have two or more sequences, generally with some potentially similar portions, that you want to line up so that the similar portions of each are associated. This is often an important component of calculating the similarity of the sequences.

Sequence alignment is frequently important in , in which sequences of DNA, RNA, or amino acids must be aligned in order to infer what mutations occurred where and when. However, sequence alignment problems occur in all domains in which there are sequences, such as in text matching.

Dynamic programming is the most commonly used technique for aligning two sequences (for multiple sequence alignment, see below). Starting from the first element of each string, each pair of elements is either aligned (if they match) or dealt with with one of the operators described below (if they don't match). For more detail, see this page.

The exact dynamic programming algorithm used depends on the specific problem at hand. Here are the two most common:

  • Needleman-Wunsch - Global alignment of two sequences (i.e. all letters in both sequences need to be used)
  • Smith-Waterman - Local alignment of two sequences (i.e. only a subsequence of each string needs to be used)

Three main operations are generally allowed in sequence alignment. It's easiest to think of these operations as things that might have happened to one of the sequences to turn it into the other sequence:

  • Insertion: An element is inserted into one of the sequences. This is generally represented by adding a gap to the opposite sequence.
  • Deletion: An element is removed from one of the sequences. This is generally inserted by adding a gap to that sequence.
  • Mutation/Substitution: An element is replaced with a different element.

Each of these operations have a cost associated with them to reflect how likely it was that they would have happened to the original sequence. Mutation/Substitution generally has a different cost for different substitutions (often generated from a BLOSUM or PAM matrix in bioinformatics). Insertion and deletion accounted for with some sort of gap penalty. In simple implementations, this penalty is often a constant cost per gap, but in bioinformatics an affine gap penalty is often more appropriate.

Multiple Sequence Alignment: Dynamic programming quickly becomes computationally intractable expensive as the number of sequences being aligned increases. For this reason, multiple sequence alignment algorithms generally do not guarantee optimality. A variety of techniques are used:

  • Progressive alignment: In this technique, a series of pairwise alignments are used to create an overall multi-way alignment. Often the order in which these alignments are performed is determined by a hierarchical clustering algorithm like neighbor-joining or UPGMA. A number of tools exist for performing such alignments in bioinformatics, such as the Clustal family and T-Coffee.
  • Heuristic approaches: A wide variety of heuristics can be used for very large scale multiple sequence alignment. Blast is by far the most popular tool for this in the case of bioinformatics.
  • Hidden-Markov Models: HMMs can be used to find the most likely alignments for a set of sequences. HMMER is a popular bioinformatics tool for this approach.
131 questions
1
vote
1 answer

dynamic programming filling matrix in sequence alignment

hello guys i have 2d char array opt[][] and i have 2 sequence in my arrays like in example my `opt[0][0]=A opt[0][1]=T opt[0][2]=G opt[0][3]=A` and opt[1][0]=A opt[2][0]=G opt[3][0]=C opt[4][0]=T i have this output currently …
1
vote
1 answer

How do I group similar strings in R?

I have a database with ~5,000 locality names, most of which are repetitions with typos, permutations, abreviations, etc. I would like to group them by similarity, to speed up further processing. The best would be to convert each variation into a…
Rodrigo
  • 4,706
  • 6
  • 51
  • 94
1
vote
2 answers

Printing a MultipleSeqAlignment Object

I have an alignment of 3 sequences generated by clustalx AAAACGT Alpha AAA-CGT Beta AAAAGGT Gamma I can sliced the alignment with the predefined indexing in Biopython via align[:,:4] However, printing the result gives: AAAA Alpha AAA- Beta AAAA…
ifreak
  • 1,726
  • 4
  • 27
  • 45
0
votes
0 answers

R Error: missing value where TRUE/FALSE needed

I have a phylip formatted text file of 300+ aligned COI sequences. I am trying to condense sequences into haplotypes for analysis using an R script written by a friend. The part I am having trouble with is where the program compares each sequence to…
0
votes
1 answer

How to query latest value of large amount of devices that are all aligned in Apache IoTDB?

IoTDB> select last * from root.station01.cell.bms01.bunch01.** limit 10 align by device; Msg: 701: Last query doesn't support align by device. I tried this statement in Apache IoTDB: select last * from root.station01.cell.bms01.bunch01.** limit…
0
votes
1 answer

How to divide a data frame in R by character alignment?

I have a data frame with sequences of peptides in the row "ID". I have the sequences grouped into many groups with around 2-10 rows per group. The groups contain some peptides that align almost perfectly (up to 4 differences in characters) and…
Nina
  • 49
  • 3
0
votes
0 answers

BWA alignment "fail to locate the index files"

This question has been asked previously, but unfortunately for me the solutions posted did not resolve my issue. I am trying to use BWA to align my ddradseq paired end reads to a reference genome, and keep running into the issue of the program…
0
votes
0 answers

msaPrettyPrint not generating graph in R

I've been trying to run the example in the msa documentation. I get a fasta file, but no graph. Instead I get text detailing the msa file. I don't need to create a pdf, I just want to see the graph in Rstudio. filepath <- system.file("examples",…
delphine
  • 1
  • 1
0
votes
0 answers

GGMSA Multiple Sequence Alignment WARNING- aligning 2 out of 5 protein sequences

newbie, again. I'm trying to run a msa using ggmsa. All sequences are short and simple, yet I'm still trying to figure out errors. Here's the problem (btw, don't judge me on my coding cause I know how bad I am in this, lol) galanin_table <-…
0
votes
0 answers

Protein pairwise alignment c++

Hi I'm trying to implement a pairwise alignment algorithm for protein sequences. I have this the exact same algorithm working for DNA sequences but for some reason when I copy it over to and implement with protein it doesn't The problem with the…
TheLordRaj
  • 35
  • 7
0
votes
1 answer

writing FASTA file output in R

I am trying to perform Multiple Sequence Alignment using ClustalW. The code works, and I am able to see the alignments on my terminal in R. Below is the code that I wrote. library(BiocManager) library(msa) library(Biostrings) mySequences <-…
thole
  • 117
  • 6
0
votes
2 answers

Counting frequency of amino acids at each position in multiple-sequence alignments

I'm wondering if anyone knows any tools which allow me to count the frequency of amino acids at any specific position in a multiple-sequence alignment. For example if I had three sequences: Species 1 - MMRSA Species 2 - MMLSA Species 3 - MMRTA…
Kodewings
  • 29
  • 7
0
votes
2 answers

Removing text from a fasta gene name between two characters

I have a large codon alignment that has a variety of gene names in the headers. The headers are in the following format: >ENST00000357033.DMD.-1 | CODON | REFERENC I want to modify all of the headers in the fasta to exclude all characters after the…
0
votes
1 answer

Sequence alignment given two strings

I have two sequences and I need to perform sequence alignment to determine all possible sequence alignments. I have created the matrix and managed to find 16 alignments. I wanted to understand if I had correctly approached this because I need to…
0
votes
1 answer

How to convert from seqinr SeqFastadna object to Biostrings DNAStringSet for multiple sequence alignment in R

I am working with DNA sequence data in fasta files, and have to work only in R for this project. I do some manipulations using the seqinr package (selecting a subset of sequences, altering the fasta headers etc). For the next stage in the analysis I…
Will Hamilton
  • 357
  • 2
  • 17
1 2 3
8 9