-2

I wrote a tiny biopython script to extract sequences from a fasta file based on ID but it does extract duplicates so I am looking to filter sequences from my fasta files which are duplicate (e.g. have the exact same ID).

I tried to modify my script but I failed:

from Bio import SeqIO

id = []
for line in open("short.txt","r"):
    id.append(line.rstrip().strip('"'))


for rec in SeqIO.parse("out.fa","fasta"):
    #print rec.id
    if rec.id in id:
        if rec.id not in rec.format:
            print rec.format("fasta")

Can anyone help?

user3188922
  • 329
  • 1
  • 3
  • 19

1 Answers1

0
ids = set()
for rec in blah:
    if rec.id not in ids:
        ids.add(rec.id)
        # process it
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • It does not seem to work – user3188922 Oct 22 '14 at 08:21
  • 1
    @user3188922: Generally when people post "it doesn't work," no further useful help will be given, because there's not a good way for us to understand what your problem is. If you want help, you need to be explicit about *what* does not work, what errors you see if any, what your input and output data look like, etc. Not "here is some random code, it doesn't work." – John Zwinck Oct 22 '14 at 08:23
  • Sorry it actually works perfectly :) my bet! I thought idS was a mistake and should be id instead but otherwise opening a new set of ids is a great idea! – user3188922 Oct 22 '14 at 08:26