2

I am new to python and I am trying to figure out how to read a fasta file with multiple sequences and then create a new fasta file containing the reverse compliment of the sequences. The file will look something like:

>homo_sapiens ACGTCAGTACGTACGTCATGACGTACGTACTGACTGACTGACTGACGTACTGACTGACTGACGTACGTACGTACGTACGTACGTACTG

>Canis_lupus CAGTCATGCATGCATGCAGTCATGACGTCAGTCAGTACTGCATGCATGCATGCATGCATGACTGCAGTACTGACGTACTGACGTCATGCATGCAGTCATG

>Pan_troglodytus CATGCATACTGCATGCATGCATCATGCATGCATGCATGCATGCATGCATCATGACTGCAGTCATGCAGTCAGTCATGCATGCATCAT

I am trying to learn how to use for and while loops so if the solution can incorporate one of them it would be preferred.

So far I managed to do it in a very unelegant manner as follows:

file1 = open('/path/to/file', 'r')

for line in file1:
   if line[0] == '>':
      print line.strip() #to capture the title line
   else:
      import re
      seq = line.strip()
      line = re.sub(r'T', r'P', seq)
      seq = line
      line = re.sub(r'A',r'T', seq)
      seq = line
      line = re.sub(r'G', r'R', seq)
      seq = line
      line = re.sub(r'C', r'G', seq)
      seq = line
      line = re.sub(r'P', r'A', seq)
      seq = line
      line = re.sub(r'R', r'C', seq)
      print line[::-1]

file1.close()

This worked but I know there is a better way to iterate through that end part. Any better solutions?

Community
  • 1
  • 1
scooterdude32
  • 65
  • 1
  • 6
  • 3
    Have you tried anything yet that is causing a problem? Please include it if so. I feel like you may have skipped the "Search and Research" step of [asking a good question](http://stackoverflow.com/help/how-to-ask). This question is very open ended and will involve a number of different kinds of functionality. For example, you will have to read the file. Then you will have to process its contents somehow. In its current form, it is probably too broad to be considered on topic. Start with one aspect of your problem like reading the file, and once you solve it, move on to the next. – jpmc26 Mar 03 '14 at 22:28
  • I really just want to know if there is a better way to iterate through the else statement – scooterdude32 Mar 03 '14 at 23:42
  • I know you don't really need my praise, but that makes it a much better question. Thank you for clarifying! – jpmc26 Mar 03 '14 at 23:47

4 Answers4

1

I know you consider this an exercise for yourself, but in case you are interested in using existing facilities, have a look at the Biopython package. Especially if you are going to do more sequence work.

That would allow you to instantiate a sequence with e.g. seq = Seq('GATTACA'). Then, seq.reverse_complement() will give you the reverse complement.

Note that the reverse complement is more than just string reversal, the nucleotide bases need to be replaced with their complementary letter as well.

s.bandara
  • 5,636
  • 1
  • 21
  • 36
  • Can I use biopython with an imported fasta file? or do I have to use something like >seq = Seq('GATTACA'). – scooterdude32 Mar 03 '14 at 22:56
  • You can also read FASTA files, using the `SeqIO` class. You should look at the manual: http://biopython.org/wiki/SeqIO. – s.bandara Mar 03 '14 at 23:08
  • +1 for your "Don't reinvent the wheel" policy. Learning that lesson is a great exercise in programming, if you ask me. ;) – jpmc26 Mar 03 '14 at 23:46
1

Assuming I got you right, would the code below work for you? You could just add the exchanges you want to the dictionary.

d = {'A':'T','C':'G','T':'A','G':'C'}

with open("seqs.fasta", 'r') as in_file:
    for line in in_file:
        if line != '\n': # skip empty lines
            line = line.strip() # Remove new line character (I'm working on windows)
            if line.startswith('>'):
                head = line
            else:
                print head
                print ''.join(d[nuc] for nuc in line[::-1])

Output:

>homo_sapiens
CAGTACGTACGTACGTACGTACGTACGTCAGTCAGTCAGTACGTCAGTCAGTCAGTCAGTACGTACGTCATGACGTACGT
ACTGACGT
>Canis_lupus
CATGACTGCATGCATGACGTCAGTACGTCAGTACTGCAGTCATGCATGCATGCATGCATGCAGTACTGACTGACGTCATG
ACTGCATGCATGCATGACTG
>Pan_troglodytus
ATGATGCATGCATGACTGACTGCATGACTGCAGTCATGATGCATGCATGCATGCATGCATGCATGATGCATGCATGCAGT
ATGCATG
Harpal
  • 12,057
  • 18
  • 61
  • 74
0

Here is a simple example of a string reversal.

Python Code

string = raw_input("Enter a string:")
reverse_string = ""

print "our string is %s" % string
print "our range will be %s\n" % range(0,len(string))

for num in range(0,len(string)):

    offset = len(string) - 1
    reverse_string += string[offset - num]

    print "the num is currently: %d" % num
    print "the offset is currently: %d" % offset
    print "the index is currently: %d" % int(offset - num)
    print "the new string is currently: %s" % reverse_string
    print "-------------------------------"

    offset =- 1

print "\nOur reverse string is: %s" % reverse_string

Added print commands to show you what is happening in the script.

Run it in python and see what happens.

Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56
0

Usually, to iterate over lines in a text file you use a for loop, because "open" returns a file object which is iterable

>>> f = open('workfile', 'w')
>>> print f
<open file 'workfile', mode 'w' at 80a0960>

There is more about this here

You can also use context manager "with" to open a file. This key statement will close the file object for you, so you will never forget it.

I decided not to include a "for line in f:" statement because you have to read several lines to process one sequence (title, sequence and blank line). If you try to use a for loop with "readline()" you will end up with a ValueError (try :)

So I would use string.translate. This script opens a file named "test" with your example in it:

import string

if __name__ == "__main__":

    file_name = "test"
    translator = string.maketrans("TAGCPR", "PTRGAC")
    with open(file_name, "r") as f:
        while True:
            title = f.readline().strip()
            if not title:  # end of file
                break
            rev_seq = f.readline().strip().translate(translator)[::-1]
            f.readline()  # blank line
            print(title)
            print(rev_seq)

Output (with your example):

>homo_sapiens
RPGTPRGTPRGTPRGTPRGTPRGTPRGTRPGTRPGTRPGTPRGTRPGTRPGTRPGTRPGTPRGTPRGTRPTGPRGTPRGTPRTGPRGT
>Canis_lupus
RPTGPRTGRPTGRPTGPRGTRPGTPRGTRPGTPRTGRPGTRPTGRPTGRPTGRPTGRPTGRPGTPRTGPRTGPRGTRPTGPRTGRPTGRPTGRPTGPRTG
>Pan_troglodytus
PTGPTGRPTGRPTGPRTGPRTGRPTGPRTGRPGTRPTGPTGRPTGRPTGRPTGRPTGRPTGRPTGPTGRPTGRPTGRPGTPTGRPTG
Alberto Megía
  • 2,225
  • 3
  • 23
  • 33