1

I have multiple datasets in Phylip format (specified below) that I would like to convert to Fasta(specified below) using this python code:

for j in range(1, 10):
    inFile = open('/path/to/input_sequence/seqfile_00' +str(j) + '.txt', 'r')
    outFile = open('/path/to/output_sequence/Fasta/seqfile_00' + str(j) +'.txt', 'w')
    inLines = inFile.readlines()
    inFile.close()
    outLines = inLines[1:17]
    for line in outLines:
        if line.startswith('\n'):
            line = line.replace('\n','')
        outFile.write(line.replace('  ',' \n').replace('sequence', '>sequence'))
outFile.close()

This is what my Phylip (input_sequences) look like:

8 1500\n
\n
sequence1  CTGTCCTTG...\n
\n
sequence2  CTGTCGTTG...\n
\n
sequence3  CTGCGTATG...\n
\n
sequence4  CTATGCCTG...\n
\n
sequence5  AGGTGTAAG...\n
\n
sequence6  AGGTGTAAG...\n
\n
sequence7  AAATTCAAA...\n
\n
sequence8  AAGTCCAAA...\n
\n

And this is what I would like my output_sequences (in Fasta format) to look like:

>sequence1 \n
CTGTCCTTGG...\n
>sequence2 \n
CTGTCGTTGG...\n
>sequence3 \n
CTGCGTATGG...\n
>sequence4 \n
CTATGCCTGG...\n
>sequence5 \n
AGGTGTAAGG...\n
>sequence6 \n
AGGTGTAAGA...\n
>sequence7 \n
AAATTCAAAG...\n
>sequence8 \n
AAGTCCAAAA...\n

When I run the above code, I get the correct output for j = 1 but the following j's (2:9) I get this output

\n
>sequence1 *red inverted question mark*CTGTCCTTGG...\n
>sequence2 *red inverted question mark*CTGTCGTTGG...\n
>sequence3 *red inverted question mark*CTGCGTATGG...\n
>sequence4 *red inverted question mark*CTATGCCTGG...\n
>sequence5 *red inverted question mark*AGGTGTAAGG...\n
>sequence6 *red inverted question mark*AGGTGTAAGA...\n
>sequence7 *red inverted question mark*AAATTCAAAG...\n
>sequence8 *red inverted question mark*AAGTCCAAAA...\n

(... is the continued sequence and red inverted question mark is what I see when I show invisibles in text wrangler).

I guess the general question, and why I am confused, is why/how the code can work fine for j =1 but not the rest of the numbers? And how to solve this issue?

Thanks in advance!

Hia3
  • 167
  • 1
  • 11
  • If you want to find empty lines use `if line.strip()`, also glob https://docs.python.org/2/library/glob.html and BioPython might be useful http://stackoverflow.com/questions/31265282/how-to-randomly-extract-fasta-sequences-using-python/31265485#31265485 – Padraic Cunningham Mar 29 '16 at 01:57

1 Answers1

0

Use strip() and bool filter:

with open('filename') as f:
    lines = filter(bool, map(lambda x: x.strip(), f.readlines()))

new_list = []

for values in lines:
    for value in values.split(' '):
        if value[0].isupper():
            new_list.append(value + '\n')
        else:
            new_list.append('>' + value + '\n')
JRazor
  • 2,707
  • 18
  • 27