Questions tagged [fasta]

FASTA is a software package for sequence alignment of proteins and nucleic acids. FASTA is also the name of the file format used by these programs to represent sequences of peptides or nucleotides. The format is a de facto standard in bioinformatics.

The FASTA format (read as "fast A format") is a text-based format used by the FASTA software for representing nucleic acids and proteins. It represents each nucleotide and amino-acid as a letter. The FASTA format also supports naming of sequences.

The format achieved great popularity, becoming the de facto standard for representing biological sequences.

A bioinformatical record in FASTA format consists of the header (comment) string followed by one or more strings describing the sequence (one letter per nucleotide or amino acid). Header strings begin with >. The sequence that follows is wrapped at a fixed width (often 60, but generally no more than 80).

> Sample nucleotide sequence
AGCACTGAGTAACGTATAAGCAGTCCCCGGACGCGTA
> Nucleotide sequence #2
GCCACGGGAGTTGAAGAACATCGAGAATGCCACTAGTTTTCACCCTTCATAGATATCCTA
GCGCCGTACATGTATACGAGATCTTTGTCACGCAGTATGGAGGATTGTGGCCAGCAATAC
GTCGTGTCCCGCAATGCTTCATTAGATCCCCGTATATCCATCCTGAGTCATTGTCTGTTG
TCCGTTTTGAAGGAGTCTAGCAGCTTGATA
921 questions
-2
votes
2 answers

Parsing huge FASTA file

I have a FASTA file and it is a huge file I want to take those sequences which has Homo sapiens. There are methods like dictionary and list where we can use to get the results. But because of the huge size we cannot use memory. We have to write the…
-2
votes
1 answer

IOError while retrieving sequences from fasta file using biopython

I have a fasta file containning PapillomaViruses sequences (entire genomes, partial CDS, ....) and i'm using biopython to retrieve entire genomes (around 7kb) from this files, so here's my code: rec_dict =…
-2
votes
2 answers

Adding sequence from FASTA file using Perl

I'm still learning Perl and I have a program which is able to take a FASTA file sequence header and print only the species name within square brackets. I want to add to this code to have it also print the entire sequence associated with the…
Elle
  • 97
  • 2
  • 6
  • 14
-2
votes
1 answer

Merge two large textfiles while comparing numbers

I want to make a large text file from data in two large text files (around 2 or 3 gb), using Java. I have to merge these two files into one, while comparing numbers in those text files.One file contains information such as this: chr1 100 200 …
-2
votes
1 answer

stockholm to fasta format - include accession id in every header

Hello I've multiple sequences in stockholm format, at the top of every alignment there is a accession ID, for ex: '#=GF AC PF00406' and '//' --> this is the end of the alignment. When I'm converting the stockholm format to fasta format I need…
Bionerd
  • 21
  • 1
  • 6
-2
votes
1 answer

Python code to read first 14 characters, uniquefy based on them, and parse duplicates

I have a list of more than 10k os string that look like different versions of this (HN5ML6A02FL4UI_3 [14 numbers or letters_1-6]), where some are duplicates except for the _1 to _6. I am trying to find a way to list these and remove the duplicate…
hdliv
  • 1
  • 2
-2
votes
1 answer

how can I extract fasta from gff file based genome fasta, then merge fasta under one transcript to output

Thanks for your help. I want to extract the specific intron fasta, then merge the intron fasta with CDS fasta to output my specific transcript.how can i do this with biopython or python? my gff file.example: 1 ensembl intron 7904 9192 . -…
Hailong Yang
  • 1
  • 1
  • 3
-2
votes
1 answer

Converting multifasta parser from Python to C#

I am trying to convert a multi fasta parser from Python to C#. For the input >header1 ACTG GCTA >header2 GATTACA it would return the dictionary {'header2': 'GATTACA', 'header1': 'ACTGGCTA'} The original Python code looks like: def…
BioGeek
  • 21,897
  • 23
  • 83
  • 145
-2
votes
5 answers

Merge two lines generated from contigs.fa

I have a file generated by assemblers. It looks like…
-3
votes
1 answer

strip() & rstrip() not removing newline from elements in list

I want to remove all of the trailing whitespace from each element in a list, but strip() and rstrip() don't seem to be doing the trick. Each element in the list looks like this, but with many more lines of nucleotide…
-3
votes
2 answers

creating a dictionary and count bases from a multifasta file in Python

To solve this problem I have used the BioPython library. Nevertheless I would like to learn programming and therefore I don't want to use BioPython library. I have one Fasta File that contains the following DNA…
David
  • 45
  • 2
  • 9
-3
votes
2 answers

How to read in specific lines from file

I have a FASTA file, it looks like this: I want this: sequence1: ATGCACCGT sequence2: GACCTAGCA as a result. how can I do it? edit: I'll try to reformulate it, so I have a (fasta) file, with multiple rows. some rows has a special character (>) as…
-3
votes
1 answer

Python divide Fasta file

I have a file containing 40 000 fasta sequences (approx.). I would like to split this file into 4 files containing 10 000 fasta sequences. How can i do that? feedback is appreciated. Thanks. jd
SigneMaten
  • 493
  • 1
  • 6
  • 13
-4
votes
1 answer

Counting nucleotide frequency using perl script

I have this perl script below to calculate sequence length and their frequency along with nucleotide frequency(A,T,G and C). This script works fine for a file with large number of sequences, but it does not give the right result for a file of small…
MAPK
  • 5,635
  • 4
  • 37
  • 88
-4
votes
1 answer

python script : sequence identifier and number of possible sequences

I need to work with python for a school project, but I really don't know how to start at it. The question is: A FASTA file contains a number of DNA sequences. Unfortunately, some of the symbols are ambiguous. The encoding is IUPAC…
1 2 3
61
62