Questions tagged [sequence-alignment]

A type problem in which two or more sequences need to be lined up with each other, generally for the purposes of identifying similarities between them. These problems are common in bioinformatics, but the algorithms used to solve them are just as relevant to aligning other types of sequences, such as text strings. A variety of algorithms have been developed for dealing with various sub-sets of this problem.

Sequence alignment problems are a group of problems in which you have two or more sequences, generally with some potentially similar portions, that you want to line up so that the similar portions of each are associated. This is often an important component of calculating the similarity of the sequences.

Sequence alignment is frequently important in , in which sequences of DNA, RNA, or amino acids must be aligned in order to infer what mutations occurred where and when. However, sequence alignment problems occur in all domains in which there are sequences, such as in text matching.

Dynamic programming is the most commonly used technique for aligning two sequences (for multiple sequence alignment, see below). Starting from the first element of each string, each pair of elements is either aligned (if they match) or dealt with with one of the operators described below (if they don't match). For more detail, see this page.

The exact dynamic programming algorithm used depends on the specific problem at hand. Here are the two most common:

  • Needleman-Wunsch - Global alignment of two sequences (i.e. all letters in both sequences need to be used)
  • Smith-Waterman - Local alignment of two sequences (i.e. only a subsequence of each string needs to be used)

Three main operations are generally allowed in sequence alignment. It's easiest to think of these operations as things that might have happened to one of the sequences to turn it into the other sequence:

  • Insertion: An element is inserted into one of the sequences. This is generally represented by adding a gap to the opposite sequence.
  • Deletion: An element is removed from one of the sequences. This is generally inserted by adding a gap to that sequence.
  • Mutation/Substitution: An element is replaced with a different element.

Each of these operations have a cost associated with them to reflect how likely it was that they would have happened to the original sequence. Mutation/Substitution generally has a different cost for different substitutions (often generated from a BLOSUM or PAM matrix in bioinformatics). Insertion and deletion accounted for with some sort of gap penalty. In simple implementations, this penalty is often a constant cost per gap, but in bioinformatics an affine gap penalty is often more appropriate.

Multiple Sequence Alignment: Dynamic programming quickly becomes computationally intractable expensive as the number of sequences being aligned increases. For this reason, multiple sequence alignment algorithms generally do not guarantee optimality. A variety of techniques are used:

  • Progressive alignment: In this technique, a series of pairwise alignments are used to create an overall multi-way alignment. Often the order in which these alignments are performed is determined by a hierarchical clustering algorithm like neighbor-joining or UPGMA. A number of tools exist for performing such alignments in bioinformatics, such as the Clustal family and T-Coffee.
  • Heuristic approaches: A wide variety of heuristics can be used for very large scale multiple sequence alignment. Blast is by far the most popular tool for this in the case of bioinformatics.
  • Hidden-Markov Models: HMMs can be used to find the most likely alignments for a set of sequences. HMMER is a popular bioinformatics tool for this approach.
131 questions
2
votes
2 answers

Print sequence alignment to file

I do a simple pairwise DNA sequence alignment with pairwiseAlignment from the Biostrings package in Bioconductor: library('Biostrings') seq1 = 'ATGCTA' seq2 = 'ATGTA' pairwiseAlignment(pattern = seq1, subject = seq2) The output looks as…
Martin Preusse
  • 9,151
  • 12
  • 48
  • 80
1
vote
1 answer

difficult to interpret .xml result file from Biopython ncbi blast function

I wanted to do the sequence search 'CCTTCATTCTTCTGTATTGGAGACTTACAGTTGGCACAAGGCTTGGAGTT' against the pig nucleotide genome sequences and see if I can find the perdect match in the alignment. I used the biopython to access the ncbi blast and fetch…
Krazykroz
  • 11
  • 1
1
vote
1 answer

How to align read to two SHORT reference sequences and see percentage that mapped to one or the other reference?

I have PCR-Amplified fastq files of a specific target region from several samples. For each sample, I want to know the percentage of reads that align better to reference sequence #1 or #2 posted below. How should I begin to tackle this question and…
1
vote
0 answers

MuscleCommandLine non-zero return code 1/is not recognized as an internal or external command,

I am trying to align 4 difference sequences using MuscleCommandLine. This code works perfectly on Anaconda and Mac but I am trying to make it work on Windows and I am having several issues. muscle_exe = r'../muscle3.8.31_i86darwin64.exe' in_file =…
1
vote
0 answers

Sequence alignment obtaining all sequences

I've been trying to do some sequence alignment with the following sequences: ggaatggmeeff gatge Im finding it difficult to understand how you determine the final alignment sequences. I am unsure if this is correct. Been reading up online regarding…
Krellex
  • 613
  • 2
  • 7
  • 20
1
vote
1 answer

Why is the output file from Biopython not found?

I work with a Mac. I have been trying to make a multiple sequence alignment in Python using Muscle. This is the code I have been running: from Bio.Align.Applications import MuscleCommandline cline = MuscleCommandline(input="testunaligned.fasta",…
kelsey
  • 11
  • 1
1
vote
0 answers

How to get differences between amino acid sequences in java?

I would like to get differences between 2 amino acid sequences after alignment For example from: target: LTTYEYLDDCRDDEE query: LATYYYLDDCRDDEE I would like to know amino acid changes and where it occurs. Here is my code: //alignment try…
vmicrobio
  • 331
  • 1
  • 2
  • 13
1
vote
0 answers

Semi Global Alignment using BioPython

Can we change global alignment using Pairwise2 in BioPython into semi-global alignment using arguments? If so, can you give an example?
1
vote
1 answer

Split fastq read into 10G mini files, assembler not accepting as fastq format

I split a 52G fastq file into 10G chunks with the following code: split -b 10G /home/bilalm/H_glaber_quality_filtering/AfterQC/good_reads/SRR530529.good.fq outputfile This produced the following files: -rw-rw-r-- 1 bilalm bilalm 10G Aug 11 13:48…
Billy
  • 69
  • 5
1
vote
0 answers

spliced and unspliced sequence alignment using STAR

I am working with single cell sequencing data, and want to run this through RNA velocity (https://www.nature.com/articles/s41586-018-0414-6). For that, I need to map both spliced and unspliced reads. The dataset I am working with is a SMARTseq…
Leon
  • 11
  • 3
1
vote
0 answers

How to build an empirical codon substitution matrix from a multiple sequence alignment

I have been trying to build an empirical codon substitution matrix given a multiple sequence alignment in fasta format using Biopython. It appears to be relatively straigh-forward for single nucleotide substitution matrices using the AlignInfo…
1
vote
1 answer

Align an array of traces in python

I have an array of traces that are look like this : Really small low part then a big High part and ended with low part again. I want to be able to align all those traces ... as close as I can (so the changes from low to high and the opposite will…
KHSX
  • 27
  • 2
  • 9
1
vote
1 answer

How to Create multiple sequence alignments with fasta files rather then strings of protein sequences in biopython

I want to be able to write a multiple sequence alignments using files I have downloaded in the same directory as my script. However in the Biopython Cookbook, the only way this is shown is via writing out strings rather then loading files. I would…
1
vote
1 answer

Construct a suffix tree of a concatination of a million words and query it with a test set to find the closest match and classify

The problem I'm trying to solve: I have a million words (multiple languages) and some classes that they classify into as my training corpora. Given the testing corpora of words (which is bound to increase in number over time) I want to get the…
1
vote
2 answers

choosing one to one result by a similarity matrix

I build a function, that finds some alignment by some metric. It gets a matrix with already computed similarity values: weighted_res may be: [[0.2, 0.5, 0.3], [0.1, 0.2, 0.4], [0.8, 0.2, 0.4], [0.1, 0.2, 0.7], [0.1, 0.2, 0.4], My function…
Prodiction
  • 187
  • 12
1 2 3
8 9