Questions tagged [dna-sequence]

A string representing the nucleotide sequence of the deoxyribonucleic acid, the molecule that holds the genes that constitute the genetic code.

Deoxyribonucleic acid (DNA) contains the genetic instructions specifying the biological development of all cellular life. DNA consists of two long polymers of simple units called nucleotides.

DNA single chain sequences are commonly represented as a string of uppercase letters that correspond to the nucleotide units in the sequence (A, G, C, T). More seldom, ambiquity codes are also used to specify that several alternative nucleotides are possible in the given position (R - A or G, Y - C or T, see complete table.

A great amount of work in bioinformatics is related with the analysis and comparison of these strings. DNA sequences may be very long or they sets may get very large (gigabytes).

Related tags:

475 questions
6
votes
1 answer

Get position of subsequence using Levenshtein-Distance

I have a huge number of records containing sequences ('ATCGTGTGCATCAGTTTCGA...'), up to 500 characters. I also have a list of smaller sequences, usually 10-20 characters. I would like to use the Levenshtein distance in order to find these smaller…
biojl
  • 1,060
  • 1
  • 8
  • 26
6
votes
1 answer

Getting all string combinations by given maximal Hamming distance (number of mismatches) in Java

Is there an algorithm go generate all possible string combinations of a string (DNA Sequence) by a given number of maximal allowed positions that can variate (maximal Mismatches, maximal Hamming distance)? The alphabet is {A,C,T,G}. Example for a…
6
votes
10 answers

Are there any existing solutions for creating a generic DNA sequence database with a website front end?

I'd like to create an rRNA sequence database with a web front end for the lab I work in. It seems common in biology to want to search a large number of sequences using alignment algorithms such as BLAST and HMMER, so I wondered if there is any…
Michael Barton
  • 8,868
  • 9
  • 35
  • 43
6
votes
2 answers

cluster short, homogeneous strings (DNA) according to common sub-patterns and extract consensus of classes

Task: to cluster a large pool of short DNA fragments in classes that share common sub-sequence-patterns and find the consensus sequence of each class. Pool: ca. 300 sequence fragments 8 - 20 letters per fragment 4 possible letters: a,g,t,c …
SimonSalman
  • 351
  • 1
  • 6
  • 13
5
votes
1 answer

Unexpected output in randomized motif search in DNA strings

I have the following t=5 DNA strings: DNA = '''CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG TAGTACCGAGACCGAAAGAAGTATACAGGCGT TAGATCAAGTTTCAGGTGCACGTCGGTGAACC AATCCACCAGCTCCACGTGCAATGTTGGCCTA''' k = 8 t = 5 I'm trying to find…
slap-a-da-bias
  • 376
  • 1
  • 6
  • 25
5
votes
5 answers

How can we compress DNA string efficiently

DNA strings can be of any length comprising any combination of the 5 alphabets (A, T, G, C, N). What could be the efficient way of compressing DNA string of alphabet comprising 5 alphabets (A, T, G, C, N). Instead of considering 3 bits per alphabet,…
5
votes
1 answer

Get span and match values from SRE_Match objects(Python)

I am trying to find a match of certain length (range 4 to 12) in a DNA sequence Below is the code: import re positions =[] for i in range(4,12): for j in range(len(dna)- i+1): positions.append(re.search(dna[j:j+i],comp_dna)) #Remove…
Kian
  • 53
  • 5
5
votes
2 answers

Find length of overlap in strings

do you know any ready-to-use method to obtain length and also overlap of two strings? However only with R, maybe something from stringr? I was looking here, unfortunately without succes. str1 <- 'ABCDE' str2 <- 'CDEFG' str_overlap(str1,…
Adamm
  • 2,150
  • 22
  • 30
5
votes
1 answer

Fastest way to add Ns to variable length sequences such that they all equal 150bp

Say I have a fasta containing 3 sequences... ATTTTTGGA AT A I want my sequence data to look like this: ATTTTTGGA ATTNNNNNN ANNNNNNNN Are there any programs or scripts that could accomplish this in a reasonable timeframe. I have thousands of…
user3105519
  • 309
  • 4
  • 10
5
votes
3 answers

Use Python to extract Branch Lengths from Newick Format

I have a list in python consisting of one item which is a tree written in Newick Format, as…
PaulBarr
  • 919
  • 6
  • 19
  • 33
5
votes
5 answers

Reverse complement DNA

I have this equation for reverse complementing DNA in python: def complement(s): basecomplement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'} letters = list(s) letters = [basecomplement[base] for base in letters] return…
Michael Ridley
  • 53
  • 1
  • 1
  • 8
4
votes
2 answers

Quantifying frequency of codons in a transmembrane sequence - apply function?

I am trying to look at the codon usage within the transmembrane domains of certain proteins. To do this, I have the sequences for the TM domain, and I want to search these sequences for how often certain codons appear (the frequency). Ideally I…
cambio
  • 43
  • 3
4
votes
3 answers

How to catch the longest sequence of a group

The task is to find the longest sequence of a group for instance, given DNA sequence: "AGATCAGATCTTTTTTCTAATGTCTAGGATATATCAGATCAGATCAGATCAGATCAGATC" and it has 7 occurrences of AGATC. (AGATC) matches all occurrences. Is it possible to write a…
hamvee
  • 141
  • 3
  • 13
4
votes
3 answers

How to find indices of identical sub-sequences in two strings in Ruby?

Here each instance of the class DNA corresponds to a string such as 'GCCCAC'. Arrays of substrings containing k-mers can be constructed from these strings. For this string there are 1-mers, 2-mers, 3-mers, 4-mers, 5-mers and one 6-mer: 6 1-mers:…
kujak al
  • 45
  • 3
4
votes
2 answers

Making a list of all mutations of a sequence (DNA)

I have a DNA sequence, and I want to find all instances of it, or any of its possible mutations in a list of DNA sequence reads. I am using grepl to do this, since it is faster than matchPattern in the instance I am using it. I use parLapply to feed…
Joe Kowzun
  • 45
  • 8
1
2
3
31 32