0

I'm working on a program that lets the user enter a sequence they want to find inside a FASTA file, after which the program shows the description line and the sequence that belongs to it. The FASTA can be found at hugheslab.ccbr.utoronto.ca/supplementary-data/IRC/IRC_representative_cdna.fa.gz, it's approx. 87 MB.

The idea is to first create a list with the location of description lines, which always start with a >. Once you know what are the description lines, you can search for the search_term in the lines between two description lines. This is exactly what is done in the fourth paragraph, this results in a list of 48425 long, here is an idea of what the results are: http://imgur.com/Lxy8hnI

Now the fifth paragraph is meant to search between two description lines, let's take lines 0 and 15 as example, this would be description_list[a] and description_list[a+1] as a = 0 and a+1 = 1, and description_list[0] = 0 and description_list[1] = 15. Between these lines the if-statement searches for the search term, if it finds one it will save description_list[a] into the start_position_list and description_list[a+1] into the stop_position_list, which will be used later on.

So as you can imagine a simple term like 'ATCG' will occur often, which means the start_position_list and stop_position_list will have a lot of duplicates, which will be removed using list(set(start_position_list)) and afterwards sorting them. That way start_position_list[0] and start_position_list[0] will be 0 and 15, like this: https://i.stack.imgur.com/L9oek.jpg, which can then be used as a range for which lines to print out to show the sequence.

Now, of course, the big issue is that line 15, for i in range(description_list[a], description_list[a+1]): will eventually hit the [a+1] while it's already at the maximum length of description_list and therefore will give a list index out of range error, as you see here as well: https://i.stack.imgur.com/bvpeg.jpg

What would be the best solution for this ? It's still necessary to go through all the description lines and I can't come up with a better structure to go through them all ?

file = open("IRC_representative_cdna.fa")
file_list = list(file)

search_term = input("Enter your search term: ")

description_list = []
start_position_list = []
stop_position_list = []

for x in range (0, len(file_list)):
    if ">" in file_list[x]:
        description_list.append(x)

for a in range(0, len(description_list)):
        for i in range(description_list[a], description_list[a+1]):
            if search_term in file_list[i]:
                start_position_list.append(description_list[a])
                stop_position_list.append(description_list[a+1])
KRAD
  • 51
  • 2
  • 10

2 Answers2

2

The way to avoid the subscript out of range error is to shorten the loop. Replace the line

for a in range(0, len(description_list)):

by

for a in range(0, len(description_list)-1):

Also, I think that you can use a list comprehension to build up description_list:

description_list = [x for x in file_list if x.startswith('>')]

in addition to being shorter it is more efficient since it doesn't do a linear search over the entire line when only the starting character is relevant.

John Coleman
  • 51,337
  • 7
  • 54
  • 119
  • Thank you for your comment, looking back on it, it was just a minor thing but sometimes it's easy to overlook where exactly the mistake is. Also thanks for letting me know about list comprehensions, I've been reading into it and it indeed seems a much more useful way ! – KRAD Dec 13 '15 at 22:19
1

Here is a solution that uses the biopython package, thus saving you the headache of parsing interleaved fasta yourself:

from Bio import SeqIO

file = open("IRC_representative_cdna.fa")
search_term = input("Enter your search term: ")

for record in SeqIO.parse(file, "fasta"):
    rec_seq = record.seq
    if search_term in rec-seq:
        print(record.id)
        print(rec-seq)

it wasn't very clear to me what your desired output is, but this code can be changed easily to fit it.

soungalo
  • 1,106
  • 2
  • 19
  • 34
  • Thank you for your answer ! This indeed seems to be an ideal way to avoid all the hassle that comes with searching through a fasta file. – KRAD Dec 13 '15 at 22:22