I have a default dict with ~700 keys. The keys are in a format such as A_B_STRING. What I need to do is split the key by '_', and compare the distance between 'STRING' for each key, if A and B are identical. If the distance is <= 2, I want to group the lists for those keys into a single defaultdict key:value group. There will be multiple keys that should match and be grouped. I also want to keep in the new combined defaultdict, all the key:value pairs that did not make it into a group.
The input file is in FASTA format, where the headers are the keys, and the values are the sequences (defaultdict is used because multiple sequences have the same header based on a blast report from an original fasta file).
This is what I have so far:
!/usr/bin/env python
import sys
from collections import defaultdict
import itertools
inp = sys.argv[1] # input fasta file; format '>header'\n'sequence'
with open(inp, 'r') as f:
h = []
s = []
for line in f:
if line.startswith(">"):
h.append(line.strip().split('>')[1]) # append headers to list
else:
s.append(line.strip()) # append sequences to list
seqs = dict(zip(h,s)) # create dictionary of headers:sequence
print 'Total Sequences: ' + str(len(seqs)) # Numb. total sequences in input file
groups = defaultdict(list)
for i in seqs:
groups['_'.join(i.split('_')[1:])].append(seqs[i]) # Create defaultdict with sequences in lists with identical headers
def hamming(str1, str2):
""" Simple hamming distance calculator """
if len(str1) == len(str2):
diffs = 0
for ch1, ch2 in zip(str1,str2):
if ch1 != ch2:
diffs += 1
return diff
keys = [x for x in groups]
combos = list(itertools.combinations(keys,2)) # Create tupled list with all comparison combinations
combined = defaultdict(list) # Defaultdict in which to place groups
for i in combos: # Combo = (A1_B1_STRING2, A2_B2_STRING2)
a1 = i[0].split('_')[0]
a2 = i[1].split('_')[0]
b1 = i[0].split('_')[1] # Get A's, B's, C's
b2 = i[1].split('_')[1]
c1 = i[0].split('_')[2]
c2 = i[1].split('_')[2]
if a1 == a2 and b1 == b2: # If A1 is equal to A2 and B1 is equal to B2
d = hamming(c1, c2) # Get distance of STRING1 vs STRING2
if d <= 2: # If distance is less than or equal to 2
combined[i[0]].append(groups[i[0]] + groups[i[1]]) # Add to defaultdict by combo 1 key
print len(combined)
for c in sorted(combined):
print c, '\t', len(combined[c])
The problem is that this code is not working as expected. When printing the keys in the combined defaultdict; I clearly see that there are many that can be combined. However, the length of the combined defaultdict is about half the size of the original.
Edit
Alternative no itertools.combinations:
for a in keys:
tocombine = []
tocombine.append(a)
tocheck = [x for x in keys if x != a]
for b in tocheck:
i = (a,b) # Combo = (A1_B1_STRING2, A2_B2_STRING2)
a1 = i[0].split('_')[0]
a2 = i[1].split('_')[0]
b1 = i[0].split('_')[1] # Get A's, B's, C's
b2 = i[1].split('_')[1]
c1 = i[0].split('_')[2]
c2 = i[1].split('_')[2]
if a1 == a2 and b1 == b2: # If A1 is equal to A2 and B1 is equal to B2
if len(c1) == len(c2): # If length of STRING1 is equal to STRING2
d = hamming(c1, c2) # Get distance of STRING1 vs STRING2
if d <= 2:
tocombine.append(b)
for n in range(len(tocombine[1:])):
keys.remove(tocombine[n])
combined[tocombine[0]].append(groups[tocombine[n]])
final = defaultdict(list)
for i in combined:
final[i] = list(itertools.chain.from_iterable(combined[i]))
However, with these method, I am still missing a few from that do not match to any others.