Questions tagged [dna-sequence]

A string representing the nucleotide sequence of the deoxyribonucleic acid, the molecule that holds the genes that constitute the genetic code.

Deoxyribonucleic acid (DNA) contains the genetic instructions specifying the biological development of all cellular life. DNA consists of two long polymers of simple units called nucleotides.

DNA single chain sequences are commonly represented as a string of uppercase letters that correspond to the nucleotide units in the sequence (A, G, C, T). More seldom, ambiquity codes are also used to specify that several alternative nucleotides are possible in the given position (R - A or G, Y - C or T, see complete table.

A great amount of work in bioinformatics is related with the analysis and comparison of these strings. DNA sequences may be very long or they sets may get very large (gigabytes).

Related tags:

475 questions
1
vote
0 answers

How do I download a large number of GenBank sequences using entrez_fetch in R?

I am trying to download sequence data from 1283 records in GenBank using rentrez. I'm using the following code, first to search for records fitting my criteria, then linking across databases, and finally fetching the sequence data: # Search for…
1
vote
1 answer

Rentrez is pulling the wrong data from NCBI in R?

I am trying to download sequence data from E. coli samples within the state of Washington - it's about 1283 sequences, which I know is a lot. The problem that I am running into is that entrez_search and/or entrez_fetch seem to be pulling the wrong…
1
vote
2 answers

Mark positions of a string in a list

I have two lists, one holds nucleotide values nucleotides = ['A', 'G', 'C', 'T', 'A', 'G', 'G', 'A', 'G', 'C'] second one holds true(1) or false(0) values for every letter to indicate that they are covered or not. flag_map = [0, 0, 0, 0, 0, 0, 0,…
neodeep
  • 15
  • 5
1
vote
2 answers

Read Clustal file in Python

I have a multiple sequence alignment (MSA) file derived from mafft in clustal format which I want to import into Python and save into a PDF file. I need to import the file and then highlight some specific words. I've tried to simply import the pdf…
1
vote
3 answers

Create a new variable instance each time I split a string in Python

I have a string into a variable x that includes ">" symbols. I would like to create a new variable each time the string is splitted at the ">" symbol. The string I have in the variable x is as such (imported from a simple .txt…
d.cio
  • 62
  • 6
1
vote
2 answers

how do I write a algorithm to find genes in a large String

I'm writing a program to find genes in a large string of DNA. My output is correct on small input strings of DNA, but when I test it on their example DNA string (which is very large—too large to check manually if my output is correct) it says that…
1
vote
1 answer

Automated introduction of mutation at a specific base in a DNA sequence

I am looking for a way to change A ->T and G ->C and vice versa at the 11th base in a 30-base DNA sequence. I have tried to use the Replace function in Excel but I couldn't work out how to make it conditional i.e. if it is A change it to T and so…
Saps
  • 21
  • 3
1
vote
3 answers

Simplify and Improve for Multi-If-Statement

I am trying to randomly generate multiple short 5 base-pair DNA sequences. Among them, I want to pick the sequences that meet the following conditions: If the first letter is A then the last letter cannot be T If the first letter is T then the last…
indigo
  • 23
  • 5
1
vote
0 answers

Faster algorithm for lexicographic comparison of DNA strings

I'm trying to find a faster way to do the following: Given a list of DNA strings x = ([s1, s2, s3, s4...]) (where the strings can only consist of the letters 'A', 'T', 'C', and 'G') and a list of index pairs y = ([[i, j], [i, j], [i, j]....]) find a…
1
vote
0 answers

Which machine learning methods can I use to predict DNA Sequences?

I have a dataset of DNA Sequences related to Covid-19 and I simply want to predict possible future sequences based on the existing sequences. DNA Sequences are consist of 4 letters and 4 letters only, A,G,T and C. So a chunk of a sequence would look…
D3WYAN
  • 37
  • 3
1
vote
1 answer

Find the postion of SNP in the gen list

I have SNP data and gen list data. I am looking for the position of SNP cotain in the gen list data when I compare with gen list. For example: The SNP data : Pos_start pos_end 14185 14185 .... ..... The gen list data:…
Phan
  • 47
  • 1
  • 6
1
vote
1 answer

A PWM with gapped alignments in Biopython

I'm trying to generate a Position-Weighted Matrix (PWM) in Biopython from Clustalw multiple sequence alignments. I get a "Wrong Alphabet" error every time I do it with gapped alignments. From reading the documentation, I think I need to utilize…
1
vote
1 answer

How to use lists and loops to count the occurrences of dinucleotide pairs?

I have a DNA text file and I need to specifically use lists and loops to count the occurrences of dinucleotide pairs (ex: AA, AC, AT, AG, CA, CC... etc) then use lists and loops again to print the counts to a new text file as a table with two…
1
vote
1 answer

transformation of csv file with dna sequences to fasta format with rstudio and with biostrings

i have a csv file with DNA sequences. The file has 4 columns which are the name of the chromosome, the start and end of the sequence and the strand (missing or +). I want to transorme this file in fasta format with Rstudio and with the tool of…
1
vote
1 answer

Python: build consensus sequence

I want to build a consensus sequence from several sequences in python and I'm looking for the most efficient / most pythonic way to achieve this. I have a list of strings like this: sequences = ["ACTAG", "-TTCG", "CTTAG"] I furthermore have an…