I want to extract specific fasta sequences from a big fasta file using the following script, but the output is empty.
The transcripts.txt
file contains the list transcripts IDs that I want to export (both the IDs and the sequences) from assembly.fasta
to selected_transcripts.fasta
.
For example:
- transcripts.txt:
Transcript_00004|5601 Transcript_00005|5352
- assembly.fasta:
>Transcript_00004|5601 GATCTGGCGCTGAGCTGGGTGCTGATCGACCCGGCGTCCGGCCGCTCCGTGAACGCCTCGAGTCGGCGCCCGGTGTGCGTTGACCGGAGATCGCGATCTGGGGAGACCGTCGTGCGGTT >Transcript_00004|5360 CGATCTGGCGCTGAGCTGGGTGCTGATCGACCCGGCGTCCGGCCGCTCCGTGAACGCCTCGAGTCGGCGCCCGGTGTGCGTTGACCGGAGATCGCGATCTGGGGAGACCGTCGTGCGGTT
The IDs are preceded by the >
symbol: >Transcripts_00004|5601
.
I have to read the assembly.fasta
file, if the transcript ID in assembly.fasta
is the same of that write in transcripts.txt
, I have to write this transcript ID and its sequence in selected_transcripts.fasta
. So, in the example above, I have to write only the first transcript.
Any suggestions? Thanks.
from Bio import SeqIO
my_list = [line.split(',') for line in open("/home/universita/transcripts.txt")]
fin = open('/home/universita/assembly.fasta', 'r')
fout = open('/home/universita/selected_transcripts.fasta', 'w')
for record in SeqIO.parse(fin,'fasta'):
for item in my_list:
if item == record.id:
fout.write(">" + record.id + "\n")
fout.write(record.seq + "\n")
fin.close()
fout.close()