Remove duplicate sequences from fasta file based on ID

Question

I wrote a tiny biopython script to extract sequences from a fasta file based on ID but it does extract duplicates so I am looking to filter sequences from my fasta files which are duplicate (e.g. have the exact same ID).

I tried to modify my script but I failed:

from Bio import SeqIO

id = []
for line in open("short.txt","r"):
    id.append(line.rstrip().strip('"'))


for rec in SeqIO.parse("out.fa","fasta"):
    #print rec.id
    if rec.id in id:
        if rec.id not in rec.format:
            print rec.format("fasta")

Can anyone help?

score 0 · Accepted Answer · answered Oct 22 '14 at 07:48

0

ids = set()
for rec in blah:
    if rec.id not in ids:
        ids.add(rec.id)
        # process it

answered Oct 22 '14 at 07:48

John Zwinck

239,568
38
324
436

It does not seem to work – user3188922 Oct 22 '14 at 08:21
1

@user3188922: Generally when people post "it doesn't work," no further useful help will be given, because there's not a good way for us to understand what your problem is. If you want help, you need to be explicit about *what* does not work, what errors you see if any, what your input and output data look like, etc. Not "here is some random code, it doesn't work." – John Zwinck Oct 22 '14 at 08:23
Sorry it actually works perfectly :) my bet! I thought idS was a mistake and should be id instead but otherwise opening a new set of ids is a great idea! – user3188922 Oct 22 '14 at 08:26

Remove duplicate sequences from fasta file based on ID

1 Answers1