0

I'm a beginner with Python and am playing around with various ways to do the simple task of reverse-complementing a DNA or RNA sequence to learn some string functions etc. My latest method nearly works but for a minor irritant that I can't find an answer to, probably because there is something I am using which I don't understand properly. My function is designed to write a blank file (this works!) and then open a file containing the sequence, loop through it one character at a time writing its reverse complement to the new file. Here's the code:

def func_rev_seq(in_path,out_path):
"""
Read file one character at a time and retrun the reverse complement of each nucleotide to a new file
"""
#  Write a blank file (out_path)
fb = open(out_path,"w")
fb.write("")
fb.close()
#  Dictionary where the key is the nucleotide and the value is its reverse complement
base = {"A":"T", "C":"G", "G":"C", "T":"A", "a":"t", "c":"g", "g":"c", "t":"a", "k":"m", "m":"k", "y":"r", "r":"y", "b":"v", "v":"b", "d":"h", "h":"d", "K":"M", "M":"K", "Y":"R", "R":"Y", "B":"V", "V":"B", "D":"H", "H":"D", "U":"A", "u":"a"} 
#  Open the source file (in_path) as fi
fi=open(in_path,"r")
i = fi.read(1)
#  Loop through the source file one character at a time and write the reverse complement to the output file
while i != "":
    i = fi.read(1)
    if i in base:
        b = base[i]   
    else:
        b = i
    with open(out_path, 'r+') as fo:
        body = fo.read()
        fo.seek(0, 0)
        fo.write(b + body)        
fi.close()
fo.close()

The problem is that when I run the function, the string in the output file is firstly truncated by a single character and secondly is below a blank line which I don't want. screen shot of input and output file examples As I understand it, the seek function with (0, 0) ought to refer to the start of the file, but I may have misunderstood. Any help greatly appreciated, thanks!

jimbat1
  • 11
  • 3

2 Answers2

0

When you put i = fi.read(1), i equaled the first character in the file, but at the beginning of the while loop, you assigned the second character to i with the same statement, without doing anything with the first character. If you want to loop over every character in the file without that problem, it's better to use a for loop. Iterating character by character in reverse is a bit challenging, but this works:

def nucleo_complement(ifilename, ofilename):
    """Reads a file one character at a time and returns the reverse
    complement of each nucleotide."""
    complements = {'A': 'T', 'T': 'A', 'C': 'G', 'G': 'C'}
    ifile = open(ifilename)
    ofile = open(ofilename, 'w')
    for pos in range(ifile.seek(0, 2) + 1, 0, -1):
        ifile.seek(pos - 1)
        char = ifile.read(1)
        ofile.write(complements.get(char.upper(), char))
    ifile.close()
    ofile.close()

seek returns the new file position, and seek(0, 2) goes to the last character in the file. Whenever you invoke read(1), the position in the file advances by one character, so I had to have pos initially equal the position of the last character plus one, and end my loop at the second character instead of the first. For each iteration, I went back a character with ifile.seek(pos - 1'), then read the next (original) character. As a beginner, this example might be a bit much, so if you have any questions, please feel free to ask. Really, all you need to think about is the first two statements in the for loop, and the fact that I had both files open simultaneously.

Isaac Saffold
  • 1,116
  • 1
  • 11
  • 20
  • Thanks Issac, that's a very useful explanation. I shall give that a go and let you know how I get on. thanks. – jimbat1 Jul 26 '17 at 10:05
  • Thanks Issac, I played around with your suggestion and it works with one small modification. Run as it was, it did indeed fix the truncation problem but there was still an additional line at the top of the output file. I tried changing: 'for pos in range(ifile.seek(0, 2) + 1, 0, -1):' to 'for pos in range(ifile.seek(0, 2) , 0, -1):' but this didn't seem to do anything so I went one further and changed it to 'for pos in range(ifile.seek(0, 2) - 1, 0, -1):'.Viola! it worked. – jimbat1 Jul 26 '17 at 19:14
  • Glad to help. And that makes sense; the last character must have been a newline. Didn't consider that. – Isaac Saffold Jul 26 '17 at 19:16
0

This is the working code, thanks to Issac. It solves both of the problems I was experiencing.

def func_rev_seq(in_path,out_path):
    """Read file one character at a time and retrun the reverse complement of each nucleotide to a new file"""

    #  Write a blank file (out_path)
    fb = open(out_path,"w")
    fb.write("")
    fb.close()
    #  Dictionary where the key is the nucleotide and the value is its reverse complement
    base = {"A":"T", "C":"G", "G":"C", "T":"A", "a":"t", "c":"g", "g":"c", "t":"a", "k":"m", "m":"k", "y":"r", "r":"y", "b":"v", "v":"b", "d":"h", "h":"d", "K":"M", "M":"K", "Y":"R", "R":"Y", "B":"V", "V":"B", "D":"H", "H":"D", "U":"A", "u":"a"} 
    fi= open(in_path)
    fo = open(out_path, 'w')

    for pos in range(fi.seek(0, 2) - 1,  0, -1):
        fi.seek(pos - 1)
        b = fi.read(1)
        if b in base:
            fo.write(base.get(b, b))
        else:
            fo.write(b)
    fi.close()
    fo.close()
jimbat1
  • 11
  • 3