Questions tagged [fasta]

FASTA is a software package for sequence alignment of proteins and nucleic acids. FASTA is also the name of the file format used by these programs to represent sequences of peptides or nucleotides. The format is a de facto standard in bioinformatics.

The FASTA format (read as "fast A format") is a text-based format used by the FASTA software for representing nucleic acids and proteins. It represents each nucleotide and amino-acid as a letter. The FASTA format also supports naming of sequences.

The format achieved great popularity, becoming the de facto standard for representing biological sequences.

A bioinformatical record in FASTA format consists of the header (comment) string followed by one or more strings describing the sequence (one letter per nucleotide or amino acid). Header strings begin with >. The sequence that follows is wrapped at a fixed width (often 60, but generally no more than 80).

> Sample nucleotide sequence
AGCACTGAGTAACGTATAAGCAGTCCCCGGACGCGTA
> Nucleotide sequence #2
GCCACGGGAGTTGAAGAACATCGAGAATGCCACTAGTTTTCACCCTTCATAGATATCCTA
GCGCCGTACATGTATACGAGATCTTTGTCACGCAGTATGGAGGATTGTGGCCAGCAATAC
GTCGTGTCCCGCAATGCTTCATTAGATCCCCGTATATCCATCCTGAGTCATTGTCTGTTG
TCCGTTTTGAAGGAGTCTAGCAGCTTGATA
921 questions
3
votes
4 answers

Filtering a fasta file with sequences that match a certain string in another file

With BLAST I have obtained a file with two tab-separated columns, one with species names and the other with a gene name (the name of the most similar gene in a reference database). My goal is to find in the first file all the species names for which…
MarcD
  • 31
  • 4
3
votes
3 answers

How do I merge two FASTA files (one file with line break) in Perl?

I have two following Fasta file: file1.fasta >0 GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT >1 GTTAAGTTATATCAAACTAAATATACATACTATAAA >2 GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC file2.qual >0 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40…
neversaint
  • 60,904
  • 137
  • 310
  • 477
3
votes
6 answers

Delete lines shorter than a certain length and the one above it (remove short sequences in a FASTA file)

I have a file containing the following text: >seq1 GAAAT >seq2 CATCTCGGGA >seq3 GAC >seq4 ATTCCGTGCC If a line that doesn't start with ">" is shorter than 5 characters, I want to delete it and the one right above it. Expected…
Honorato
  • 111
  • 6
3
votes
3 answers

Appending filename at the end of certain lines in a text file

I am trying to append a file name at the end of certain lines in many files which I am concatenating. short example: INPUTS: filename (1): 1234_contigs.fasta >NODE_STUFF GATTACA filename (2):…
3
votes
3 answers

awk combine info from two files (fasta file header)

I know there are many similar questions, and I had read through many of them. But I still can't make my code work. Could somebody point the problem out for me please? Thanks! (base) $ head Sample.pep2 >M00000032072 gene=G00000025773 seq_id=ChrM…
zzz
  • 153
  • 8
3
votes
1 answer

bioinformatics compressing nucleotide sequences

What would be the recommended compression algorithm (.xz, tar.gz, tar.bz2 and so on) for compressing a dataset consisting of fasta nucleotide sequences? What would be the recommended compression mechanisms for such data? Dictionary based…
Allan K
  • 379
  • 2
  • 13
3
votes
2 answers

Removing lines which match with specific pattern from another file

I've got two files (I only show the beginning of these files)…
Paillou
  • 779
  • 7
  • 16
3
votes
4 answers

How to retrieve sequences from a Fasta file by gene ID

I know this question has been asked a hundred times but I've been at it all day and I can't seem to make this work. I have a fasta file that looks like this ... >BGI_novel_T016697…
Tezie
  • 67
  • 2
  • 6
3
votes
2 answers

Find length of a contig in one fasta, using the header of another fasta as query in python

I'm trying to find a python solution to extract the length of a specific sequence within a fasta file using the full header of the sequence as the query. The full header is stored as a variable earlier in the pipeline (i.e. "CONTIG"). I would like…
Gunther
  • 129
  • 7
3
votes
1 answer

How to remove duplicates from fasta file but keep at least one per group based on header

I have a multifasta file that looks like this: ( all sequences are >100bp, more than one line, and same lenght…
Xela Vi
  • 113
  • 7
3
votes
1 answer

Pairwise alignment of multi-FASTA file sequences

I have multi-FASTA file containing more than 10 000 fasta sequences resulted from Next Generation Sequencing and I want to do pairwise alignment of each sequence to each sequence inside the file and store all the results in the same new file in…
Aurora
  • 31
  • 4
3
votes
1 answer

Is there a way to collect many multiline strings delineated by a specific character into an Arraylist using the data stream in Java 8?

I have a fasta file that I want to parse into an ArrayList, each position having an entire sequence. The sequences are multiline strings, and I don't want to include the identification line in the string that I store. My current code splits each…
Sam
  • 33
  • 3
3
votes
8 answers

Remove multiple sequences from fasta file

I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as…
3
votes
2 answers

Directly calling SeqIO.parse() in for loop works, but using it separately beforehand doesn't? Why?

In python this code, where I directly call the function SeqIO.parse() , runs fine: from Bio import SeqIO a = SeqIO.parse("a.fasta", "fasta") records = list(a) for asq in SeqIO.parse("a.fasta", "fasta"): print("Q") But this, where I first…
3
votes
5 answers

Extract sequence header for a given sequence in fasta file

I have a fasta file(myfasta.fasta) like this: >aat.2.2344.a ATTGCCGGTTTAATATTA >aat.2.d2344.acc ATTGCCGGTTTAATAAA >aat.2.2bb344.a ATTGCCGGTTTAATAGGAGAGAATT >aat.2.2ccc344.a ATTGCCGGTTTAATAGGGAG >aat.2.2344.acc ATTGCCGGTTTAATAAA I also have a text…
MAPK
  • 5,635
  • 4
  • 37
  • 88