4

I want to extract specific fasta sequences from a big fasta file using the following script, but the output is empty.

The transcripts.txt file contains the list transcripts IDs that I want to export (both the IDs and the sequences) from assembly.fasta to selected_transcripts.fasta. For example:

  1. transcripts.txt:
    Transcript_00004|5601
    Transcript_00005|5352
  2. assembly.fasta:
    >Transcript_00004|5601
    GATCTGGCGCTGAGCTGGGTGCTGATCGACCCGGCGTCCGGCCGCTCCGTGAACGCCTCGAGTCGGCGCCCGGTGTGCGTTGACCGGAGATCGCGATCTGGGGAGACCGTCGTGCGGTT
    >Transcript_00004|5360
    CGATCTGGCGCTGAGCTGGGTGCTGATCGACCCGGCGTCCGGCCGCTCCGTGAACGCCTCGAGTCGGCGCCCGGTGTGCGTTGACCGGAGATCGCGATCTGGGGAGACCGTCGTGCGGTT
    

The IDs are preceded by the > symbol: >Transcripts_00004|5601.

I have to read the assembly.fasta file, if the transcript ID in assembly.fasta is the same of that write in transcripts.txt, I have to write this transcript ID and its sequence in selected_transcripts.fasta. So, in the example above, I have to write only the first transcript.

Any suggestions? Thanks.

from Bio import SeqIO

my_list = [line.split(',') for line in open("/home/universita/transcripts.txt")]

fin = open('/home/universita/assembly.fasta', 'r')
fout = open('/home/universita/selected_transcripts.fasta', 'w')

for record in SeqIO.parse(fin,'fasta'):
    for item in my_list:
        if item == record.id:
            fout.write(">" + record.id + "\n")
            fout.write(record.seq + "\n")

fin.close()
fout.close()
Markus
  • 385
  • 2
  • 10
Chiara E
  • 107
  • 2
  • 8

1 Answers1

1

Based on your examples there a couple of small problems which may explain why you don't get anything. Your transcripts.txt has multiple entries in one line, therefore my_list will have all the items of the first_line in my_line[0], in your loop you iterate through my_list by lines, so your first item will be

['Transcript_00004|5601', 'Transcript_00005|5352']

Also if assembly.fasta has no > in the header line you won't get back any records with IDs and sequences. The following code should take care of those problems, assuming you added > to the headers and the split function is now using space and not colon.

from Bio import SeqIO

my_list = []
with open("transcripts.txt") as transcripts:
    for line in transcripts:
        my_list.extend(line.split(' '))

fin = open('assembly.fasta', 'r')
fout = open('selected_transcripts.fasta', 'w')

for record in SeqIO.parse(fin,'fasta'):
    for item in my_list:
        if item.strip() == record.id:
            fout.write(">" + record.id + "\n")
            fout.write(record.seq + "\n")


fin.close()
fout.close()

Reading of transcript was changed so that all IDs are appended to my_list separately. Also each item is stripped of white space in order to avoid having line breaks in the string when it is compared to the record.id.

Maximilian Peters
  • 30,348
  • 12
  • 86
  • 99