4

I solved this in the comments below.

So essentially what I am trying to do is add each element of a list of strings to the end of specific lines in a different file.

Hard to explain but essentially I want to parse a FASTA file, and every time it reaches a header (line.startswith('>')) I want it to replace parts of that header with an element in a list I've already made.

For example:

File1:

">seq1 unwanted here

AATATTATA

ATATATATA

>seq2 unwanted stuff here

GTGTGTGTG

GTGTGTGTG

>seq3 more stuff I don't want

ACACACACAC

ACACACACAC"

I want it to keep ">seq#" but replace everything after with the next item in the list below:

List: mylist = "['things1', '', 'things3', 'things4', '' 'things6', 'things7']"

Result (modified file1):

">seq1 things1

AATATTATA

ATATATATA

>seq2 # adds nothing here due to mylist[1] = ''

GTGTGTGTG

GTGTGTGTG

>seq3 things3

ACACACACAC

ACACACACAC

As you can see I want it to add even the blank items in the list.

So once again, I want it to parse this FASTA file, and every time it gets to a header (there are thousands), I want it to replace everything after the first word with the next item in the separate list I have made.

bburc
  • 169
  • 1
  • 2
  • 13
  • You should add your solution as a solution, not in the body of your question. That way it makes sense to future reasons who did not see your original question which has now been prepended with the answer to your question. – Hack-R Apr 29 '15 at 00:00
  • Please see [formatting](http://stackoverflow.com/help/formatting) help for proper ways to type headings, lists, code etc... –  Apr 29 '15 at 00:57

3 Answers3

1

What you have will work, but there are a few unnecessary lines so I've edited down to use a few less lines. Also, an important note is that you don't close your file handles. This could result in errors, specifically when writing to file, either way it's bad practice. code:

#!/usr/bin/python

import sys

# gets list of annotations
def get_annos(infile):
    with open(infile, 'r') as fh:  # makes sure the file is closed properly
        annos = []
        for line in fh:
            annos.append( line.split('\t')[5] ) # added tab as separator

    return annos

# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
    annos = get_annos(infile1) # contains list of annos
    with open(infile2, 'r') as f2, open(outfile, 'w') as output:
        for line in f2:
            if line.startswith('>'):
                line_split = list(line.split()[0]) # split line on whitespace and store first element in list
                line_split.append(annos.pop(0)) # append data of interest to current id line
                output.write( ' '.join(line_split) + '\n' ) # join and write to file with a newline character
            else:
                output.write(line)

anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]

add_annos(anno, seq, out)
get_annos(anno)

This is not perfect but it cleans things up a bit. I'd might veer away from using pop() to associate the annotation data with the sequence IDs unless you are certain the files are in the same order every time.

afinit
  • 985
  • 1
  • 9
  • 16
1

There is a great library in python for Fasta and other DNA file parsing. It is totally helpful in Bioinformatics. You can also manipulate any data according to your need. Here is a simple example extracted from the library website:

from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

You should get something like this on your screen:

gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592
Sam Al-Ghammari
  • 1,021
  • 7
  • 23
0

***********EDIT*********

I solved this before anyone could help. This is my code, can anyone tell me if I have any bad practices? Is there a way to do this without writing everything to a new file? Seems like it would take a long time/lots of memory.




#!/usr/bin/python
# Script takes unedited FASTA file, removed seq length and
# other header info, adds annotation after sequence name
# run as: $ python addanno.py testanno.out testseq.fasta out.txt

import sys

# gets list of annotations
def get_annos(infile):
    f = open(infile)
    list2 = []
    for line in f:
        columns = line.strip().split('\t')
        list2.append(columns[5])
    return list2

# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
    mylist = get_annos(infile1) # contains list of annos
    f2 = open(infile2, 'r')
    output = open(out, 'w')
    for line in f2:
        if line.startswith('>'):
            l = line.partition(" ")
            list3 = list(l)
            del list3[1:]
            list3.append(' ')
            list3.append(mylist.pop(0))
            final = ''.join(list3)
            line = line.replace(line, final)
            output.write(line)
            output.write('\n')
       else:
            output.write(line)

anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]

add_annos(anno, seq, out)
get_annos(anno)

bburc
  • 169
  • 1
  • 2
  • 13
  • I added a separate solution with the code cleaned up a bit. As far as resources go, bioinformatics is always going to be resource intensive (string processing), but what you're doing here shouldn't take too long, especially with the superfluous processing removed. If you wanted to write back to the same file you'd need to save everything to memory first, then reopen the file for writing. Since you're working with a fasta file here it shouldn't really take that much extra memory (compared to other biodata). – afinit Apr 29 '15 at 04:03
  • Cool I appreciate it. A couple small errors in your code and I tried running it and it didn't end up as I planned (no time to mess with it at all) but I'm going to use ideas Yes I am sure the list is going to be in the same order since the original file is in order and it will simply append each element one after the other. I also have access to a cluster which is very powerful so memory doesn't matter, just was curious if there was a different way then writing to a new file. Thanks again – bburc Apr 29 '15 at 21:29
  • I fixed the whitespace error at the else line, let me know if you found any others so I can fix them in the post – afinit Apr 29 '15 at 21:35
  • First function you use fh then f. Second function's for loop...indents needed after block started, and you also have line_split then use list_split the next line. The output takes my sequence name and splits it into letters, then only appends the first word out of the string I want. For instance, I need '>compX_cY_seqZ Added Words Here' and your output is: '> c o m p X _ c Y _ s e q Z Added' – bburc Apr 29 '15 at 22:01
  • ah.. sorry bout that. thanks for the other errors and I didn't understand your data format. It should be fixed now – afinit Apr 29 '15 at 22:10
  • It's all good, I fixed your code and it works perfectly now. Basically just remove the [0] from your line_split definition in the for loop and add the next line "del line_split[1:]" I found the way you had it is what caused the first word to split into characters since "line_split = list(line.split()[0])" is first making splitting the line at whitespace (into words), then selecting the first word and THEN turning it into a list I BELIEVE..? By not selecting that index when defining line_split, you can get the list split by word and can remove all elements after [0] to keep the first. – bburc Apr 29 '15 at 22:42
  • I didn't realize you had spaces in one of the fields. `list(line.split('\t')[0])` is better because it's one step.. "simple is better than complex" – afinit Apr 29 '15 at 23:31