Extract specific fasta sequences from a big fasta file

Question

I want to extract specific fasta sequences from a big fasta file using the following script, but the output is empty.

The transcripts.txt file contains the list transcripts IDs that I want to export (both the IDs and the sequences) from assembly.fasta to selected_transcripts.fasta. For example:

transcripts.txt:

Transcript_00004|5601
Transcript_00005|5352

assembly.fasta:

>Transcript_00004|5601
GATCTGGCGCTGAGCTGGGTGCTGATCGACCCGGCGTCCGGCCGCTCCGTGAACGCCTCGAGTCGGCGCCCGGTGTGCGTTGACCGGAGATCGCGATCTGGGGAGACCGTCGTGCGGTT
>Transcript_00004|5360
CGATCTGGCGCTGAGCTGGGTGCTGATCGACCCGGCGTCCGGCCGCTCCGTGAACGCCTCGAGTCGGCGCCCGGTGTGCGTTGACCGGAGATCGCGATCTGGGGAGACCGTCGTGCGGTT

The IDs are preceded by the > symbol: >Transcripts_00004|5601.

I have to read the assembly.fasta file, if the transcript ID in assembly.fasta is the same of that write in transcripts.txt, I have to write this transcript ID and its sequence in selected_transcripts.fasta. So, in the example above, I have to write only the first transcript.

Any suggestions? Thanks.

from Bio import SeqIO

my_list = [line.split(',') for line in open("/home/universita/transcripts.txt")]

fin = open('/home/universita/assembly.fasta', 'r')
fout = open('/home/universita/selected_transcripts.fasta', 'w')

for record in SeqIO.parse(fin,'fasta'):
    for item in my_list:
        if item == record.id:
            fout.write(">" + record.id + "\n")
            fout.write(record.seq + "\n")

fin.close()
fout.close()

Can you [edit] your question and include some of `transcripts.txt` as well as a portion of `assembly.fasta` so we have some data to work with? — MattDMo, May 13 '16 at 20:02
You split your transcripts line after each colon but it is space separated. Is that on purpose? — Maximilian Peters, May 14 '16 at 11:13
Sorry, in the trascripts.txt, each ID is a line of the file. So the first transcript in the example is in the 1st lane, the second transcript is in the 2nd lane and so on. — Chiara E, May 14 '16 at 11:37

Maximilian Peters · Answer 1 · 2016-05-14T11:54:18.583

Based on your examples there a couple of small problems which may explain why you don't get anything. Your transcripts.txt has multiple entries in one line, therefore my_list will have all the items of the first_line in my_line[0], in your loop you iterate through my_list by lines, so your first item will be

['Transcript_00004|5601', 'Transcript_00005|5352']

Also if assembly.fasta has no > in the header line you won't get back any records with IDs and sequences. The following code should take care of those problems, assuming you added > to the headers and the split function is now using space and not colon.

from Bio import SeqIO

my_list = []
with open("transcripts.txt") as transcripts:
    for line in transcripts:
        my_list.extend(line.split(' '))

fin = open('assembly.fasta', 'r')
fout = open('selected_transcripts.fasta', 'w')

for record in SeqIO.parse(fin,'fasta'):
    for item in my_list:
        if item.strip() == record.id:
            fout.write(">" + record.id + "\n")
            fout.write(record.seq + "\n")


fin.close()
fout.close()

Reading of transcript was changed so that all IDs are appended to my_list separately. Also each item is stripped of white space in order to avoid having line breaks in the string when it is compared to the record.id.

Extract specific fasta sequences from a big fasta file

1 Answers1