Python - grouping defaultdict values by hamming distance in keys

Question

I have a default dict with ~700 keys. The keys are in a format such as A_B_STRING. What I need to do is split the key by '_', and compare the distance between 'STRING' for each key, if A and B are identical. If the distance is <= 2, I want to group the lists for those keys into a single defaultdict key:value group. There will be multiple keys that should match and be grouped. I also want to keep in the new combined defaultdict, all the key:value pairs that did not make it into a group.

The input file is in FASTA format, where the headers are the keys, and the values are the sequences (defaultdict is used because multiple sequences have the same header based on a blast report from an original fasta file).

This is what I have so far:

!/usr/bin/env python

import sys
from collections import defaultdict
import itertools

inp = sys.argv[1]                                                       # input fasta file; format '>header'\n'sequence'

with open(inp, 'r') as f:
        h = []
        s = []
        for line in f:
                if line.startswith(">"):
                        h.append(line.strip().split('>')[1])            # append headers to list
                else:
                        s.append(line.strip())                          # append sequences to list

seqs = dict(zip(h,s))                                                   # create dictionary of headers:sequence

print 'Total Sequences: ' + str(len(seqs))                              # Numb. total sequences in input file

groups = defaultdict(list)

for i in seqs:
        groups['_'.join(i.split('_')[1:])].append(seqs[i])                      # Create defaultdict with sequences in lists with identical headers

def hamming(str1, str2):
        """ Simple hamming distance calculator """
        if len(str1) == len(str2):
                diffs = 0
                for ch1, ch2 in zip(str1,str2):
                        if ch1 != ch2:
                                diffs += 1
                return diff

keys = [x for x in groups]

combos = list(itertools.combinations(keys,2))                           # Create tupled list with all comparison combinations

combined = defaultdict(list)                                            # Defaultdict in which to place groups

for i in combos:                                                        # Combo = (A1_B1_STRING2, A2_B2_STRING2)
        a1 = i[0].split('_')[0]
        a2 = i[1].split('_')[0]

        b1 = i[0].split('_')[1]                                         # Get A's, B's, C's
        b2 = i[1].split('_')[1]

        c1 = i[0].split('_')[2]
        c2 = i[1].split('_')[2]

        if a1 == a2 and b1 == b2:                                       # If A1 is equal to A2 and B1 is equal to B2
                d = hamming(c1, c2)                                     # Get distance of STRING1 vs STRING2
                if d <= 2:                                              # If distance is less than or equal to 2
                        combined[i[0]].append(groups[i[0]] + groups[i[1]])      # Add to defaultdict by combo 1 key

print len(combined)
for c in sorted(combined):
        print c, '\t', len(combined[c])

The problem is that this code is not working as expected. When printing the keys in the combined defaultdict; I clearly see that there are many that can be combined. However, the length of the combined defaultdict is about half the size of the original.

Edit

Alternative no itertools.combinations:

for a in keys:
        tocombine = []
        tocombine.append(a)
        tocheck = [x for x in keys if x != a]
        for b in tocheck:
                i = (a,b)                                               # Combo = (A1_B1_STRING2, A2_B2_STRING2)
                a1 = i[0].split('_')[0]
                a2 = i[1].split('_')[0]

                b1 = i[0].split('_')[1]                                         # Get A's, B's, C's
                b2 = i[1].split('_')[1]

                c1 = i[0].split('_')[2]
                c2 = i[1].split('_')[2]

                if a1 == a2 and b1 == b2:                                       # If A1 is equal to A2 and B1 is equal to B2
                        if len(c1) == len(c2):                                          # If length of STRING1 is equal to STRING2
                                d = hamming(c1, c2)                                     # Get distance of STRING1 vs STRING2
                                if d <= 2:
                                        tocombine.append(b)
        for n in range(len(tocombine[1:])):
                keys.remove(tocombine[n])
                combined[tocombine[0]].append(groups[tocombine[n]])

final = defaultdict(list)
for i in combined:
        final[i] = list(itertools.chain.from_iterable(combined[i]))

However, with these method, I am still missing a few from that do not match to any others.

You're missing one " in your hamming documention string, it is causing a formatting error, can you put that in there? I would edit it for you but stack overflow requires at least a 6 character edit :/ — Aaron, Jun 10 '16 at 16:11

Aaron · Answer 1 · 2016-06-10T21:39:53.773

I think I see one issue with your code consider this scenario:

0: A_B_DATA1 
1: A_B_DATA2    
2: A_B_DATA3 

All the valid comparisons are:  
0 -> 1 * Combines under key 'A_B_DATA1' 
0 -> 2 * Combines under key 'A_B_DATA1'
1 -> 2 * Combines under key 'A_B_DATA2' **opps

I would imagine you would want all three of these to combine under 1 key. However consider this scenario:

0: A_B_DATA111
1: A_B_DATA122    
2: A_B_DATA223 

All the valid comparisons are:  
0 -> 1 * Combines under key 'A_B_DATA111' 
0 -> 2 * Combines under key 'A_B_DATA111'
1 -> 2 * Combines under key 'A_B_DATA122'

Now it gets a bit tricky since row 0 is distance 2 from row 1, and row 1 is distance 2 from row 2, but you might not want them all together as row 0 is distance 3 away from row 2!

Here is an example of a working solution assuming this is what you want the output to look like:

def unpack_key(key):
    data = key.split('_')
    return '_'.join(data[:2]), '_'.join(data[2:])

combined = defaultdict(list)
for key1 in groups:
    combined[key1] = []
    key1_ab, key1_string = unpack_key(key1)
    for key2 in groups:
        if key1 != key2:
            key2_ab, key2_string = unpack_key(key2)
            if key1_ab == key2_ab and len(key1_string) == len(key2_string):
               if hamming(key1_string, key2_string) <= 2:
                   combined[key1].append(key2)

In our second example, this would result in the following dictionary, if that isn't the answer you are looking for, could you type out exactly what the final dictionary for this example should be?

A_B_DATA111: ['A_B_DATA122']
A_B_DATA122: ['A_B_DATA111', 'A_B_DATA223']
A_B_DATA223: ['A_B_DATA122']

Keep in mind this is a O(n^2) algorithm, meaning it is not scalable when your key set grows larger.

I see what you mean with the key issue. I think I've come up with a solution, not using itertoosl combinations, see edit — st.ph.n, Jun 10 '16 at 18:19
Can you give me what you would expect the output to be for the above scenarios? — Aaron, Jun 10 '16 at 20:02
The output should be a default dict where the keys contain values that passed the above criteria: All values original header must have the same A and B, and each STRINGs should differ by <= 2 chars. — st.ph.n, Jun 10 '16 at 20:07
I get that, but I feel there are multiple ways to represent that answer. Can you give me exactly what the output should look like (what the resulting dictionaries keys/values will be). — Aaron, Jun 10 '16 at 20:56
ok. so suppose key A_B_DATA111 has in its list 10 sequences, and A_B_DATA122 has 7, and A_B_DATA223 has 4. In the final dict A_B_DATA111 would be the representative key to search for other keys with the above criteria. A_B_DATA122 should fall under A_B_DATA111. A_B_DATA223, would searched against other keys. — st.ph.n, Jun 10 '16 at 21:02

Python - grouping defaultdict values by hamming distance in keys

1 Answers1