4

I'm trying to sort a fasta file by alphabetical order of the sequences in the file (not the ID of the sequences). The fasta file contains over a 200 sequences and I'm trying to find duplicates (by duplicates I mean almost same protein sequence, but not same ID) within a bit master (using a python code). So I wanted to make a dictionary out of the fasta file and then sort dictionary's values. The code I am trying to use is the following :

from Bio import SeqIO


input_file = open("PP_Seq.fasta")    
my_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))
print sorted(my_dict.values())

I keep getting this message error :

"Traceback (most recent call last):
  File "sort.py", line 4, in <module>
    print sorted(my_dict.values())
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqRecord.py", line 730, in __lt__
    raise NotImplementedError(_NO_SEQRECORD_COMPARISON)
NotImplementedError: SeqRecord comparison is deliberately not implemented. Explicitly compare the attributes of interest."

I also tried to look for how to fin this error but there ares't much information about this, and few of the informations I read where apparently saying that the length of sequences stored in dictionary dictionary may be a problem?... If so how to sort the fasta file without SeqIO?

BioGeek
  • 21,897
  • 23
  • 83
  • 145
  • Is your dict like `{fasta_header: sequence}`? – Carles Mitjans Feb 21 '17 at 10:58
  • 1
    That means that `SeqRecords` are not compareable so they can't be sorted. By what key do you want to sort them? Something like `sorted(my_dict.values(), key=operator.attrgetter('seq'))` would probably work. – mata Feb 21 '17 at 11:06
  • @mata lets say I have this file as input : >seq0 ABCWYXO >seq1 IJKLMNOP >seq2 BCDEFGH >seq3 ABCDEFG I want the output to be a file organized like this : >seq3 ABCDEFG >seq0 ABCWYXO >seq2 BCDEFGH >seq1 IJKLMNOP Basically sort them out by alphabetical order of protein sequences... So I am thinking like comparing one sequence (string) to the other one character at the time in a loop, and sort them in that way. Be able to place them in a new file based on that order and retrieving their own ID each time... – Armelle Quinn Feb 21 '17 at 11:29
  • 2
    You need to pass a [`key` function](https://docs.python.org/3/howto/sorting.html#key-functions) to `sorted()` which returns a sortable key for each record. `operator.attrgetter('seq')` should do it, it returns a (compareable) [`Seq`](https://github.com/biopython/biopython/blob/master/Bio/Seq.py) for each [`SeqRecord`](https://github.com/biopython/biopython/blob/master/Bio/SeqRecord.py#L95), but I don't really have any experience with those libraries. – mata Feb 21 '17 at 11:41
  • You might achieve something faster using `fastq-sort -s` from fastq-tools and bioawk to create and then remove dummy qualities. See http://stackoverflow.com/a/41278639/1878788 for a related case. – bli Feb 22 '17 at 09:21

2 Answers2

2

As mata said, you need to pass a key function to sorted:

from Bio import SeqIO
import operator
input_file = open("example.fasta")    
my_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))
for r in sorted(my_dict.values(), key=operator.attrgetter('seq')):
    print r.id, str(r.seq)

returns:

seq3 ABCDEFG
seq0 ABCWYXO
seq2 BCDEFGH
seq1 IJKLMNOP

Now, for what you're trying to accomplish. If you've sorted the 200 sequences alphabetically, you still need to scan the list manually to find the near duplicates. That is error prone, so better write some code for that as well.

In computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other.

There are several implementations of this algorithm available. We'll take the one from this answer.

def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

Now we need to decide on a treshold of how dissimilar (how many insertions/delations/substituons) two sequences may be. Then we pairwise compare every two sequences in our FASTA file:

from Bio import SeqIO
from itertools import combinations
input_file = open("example.fasta")    

treshold = 4
records = SeqIO.parse(input_file, "fasta")
for record1, record2 in combinations(records, 2):
    edit_distance = levenshteinDistance(str(record1.seq), str(record2.seq))
    if edit_distance <= treshold:
        print "{} and {} differ in {} characters".format(record1.id, record2.id, edit_distance)

This gives:

seq0 and seq3 differ in 4 characters
seq2 and seq3 differ in 2 characters
Community
  • 1
  • 1
BioGeek
  • 21,897
  • 23
  • 83
  • 145
  • How do I turn the first print result into a dictionary? I tried 'for r in sorted(sequences.values(), key=operator.attrgetter('seq')): t = dict(r.id, str(r.seq)) – Katie S Apr 28 '23 at 10:09
1

Here is another way to eliminate duplicates from a fasta file based on bioawk and fastq-tools (and also awk and uniq which are usually present in any UNIX-like command-line environment):

bioawk -c fastx '{print "@"$name"\n"$seq"\n+\n"$seq}' test.fasta \
    | fastq-sort -s \
    | bioawk -c fastx '{print $name"\t"$seq}' \
    | uniq -f 1 \
    | awk '{print ">"$1"\n"$2}'

bioawk is a modified version of awk that facilitates the manipulation of some common bioinformatics formats.

The first line converts the data to fastq format because that's what fastq-sort can work with. The -c fastx option tells bioawk to parse the data as fasta or fastq. This defines a $name and a $seq field that can be used in the command. We use the $seq field as a dummy quality to get valid fastq format.

The second line tells fastq-sort (from fastq-tools) to sort the data by sequence (option -s).

The third line uses bioawk to extract the name and sequence and put them as two tab-separated fields, to have the relevant information in one line per record.

The fourth line uses uniq to eliminate duplicates ignoring the first field when comparing successive lines (if I understood properly the meaning of -f option I just discovered). If my understanding of uniq is correct, the name of the fused records will be that of the first record in a same-sequence series.

The fifth line uses awk to reformat the tab-separated fields into fasta.

I tested this approach on some of my data and it seemed to work.

bli
  • 7,549
  • 7
  • 48
  • 94