2

I need to create a dictionary, values could be left blank or zero but i need the keys to be all the possible combinations of ABCD characters with lenght k. For example, for k = 8

lex = defaultdict(int)     
lex = {
'AAAAAAAA':0,
'AAAAAAAB':0,
'AAAAAABB':0,
...}

so far i have tried somethink like this, i know it's wrong but i have no idea how to make it work, i'm new in python so please bear with me.

mydiction = {}
mylist = []
mylist = itertools.permutations('ACTG', 8)
for keys in mydiction:
    mydiction[keys] = mylist.next()
print(mydiction)
Christos Karapapas
  • 1,018
  • 3
  • 19
  • 40

3 Answers3

4

You can do it in one line, but what you are looking for is combinations_with_replacement

from itertools import combinations_with_replacement
mydict = {"".join(key):0 for key in combinations_with_replacement('ACTG', 8)}
thefourtheye
  • 233,700
  • 52
  • 457
  • 497
2

What you're describing isn't permutations, but combinations with replacement. There's a function for that in the itertools module as well.

Note, however, that there are sixty thousand combinations there. Trying to put them all in a dict, or even just iterate over them all, is NOT going to produce happy results.

What's your use case? It's possible you just need to recognize combinations, rather than generating them all exhaustively. And each combination is intrinsically associated with a particular 16-bit integer index, so you could instead store and operate on that.

Sneftel
  • 40,271
  • 12
  • 71
  • 104
  • 1
    i know that playing with such numbers isn't the best practice and i know that there is better solution to what i try to do, it's just that to make the algorithm more elegant is an even more complex task. that's why i need to do this part with this "brute force" method and after i get my results i'll try to refine it. – Christos Karapapas Nov 15 '13 at 12:40
2

Although the combinations_with_replacement function works perfectly fine, you will be generating a huge list of string with a collision rate which is relatively high (around 20%)

What you are looking to do can be done using base4 integers. Not only are they faster to process, more memory efficient, but they also have 0 collision (each number is its own hash) meaning a guaranteed O(1) look-up time in worst case.

def num_to_hash(n, k, literals='ABCD'):
    return ''.join((literals[(n >> (k - x)*2 & 3)] for x in xrange(1, k+1)))

k = 2
d = {num_to_hash(x, k, 'ACTG'): 0 for x in xrange((4**k) - 1)}
print d 

output:

{'AA': 0,
 'AC': 0,
 'AG': 0,
 'AT': 0,
 'CA': 0,
 'CC': 0,
 'CG': 0,
 'CT': 0,
 'GA': 0,
 'GC': 0,
 'GT': 0,
 'TA': 0,
 'TC': 0,
 'TG': 0,
 'TT': 0}
Samy Arous
  • 6,794
  • 13
  • 20
  • 1
    This doesn't generate combinations with replacement but the Cartesian product. And I'm not sure I buy your collision rate: for me, even with a 20-character key, I have a dictionary size of 852610 and 852529 unique hashes, so there is a negligible collision rate. (I think it'd be silly to worry about it anyway, but I can't follow where your numbers are coming from.) – DSM Nov 15 '13 at 13:09
  • 1
    It's not hard to verify actually, create a list of such strings, create a list composed of the hashes of every element and make it into a set. The difference in length between the set and the list is equal to the number of collisions. Sure it's not an issue, string hashes are known to have good performance. The 20% number was found using only an 8 char length string. Sure a deeper analysis is required and I'm sure that the impact on the overall performance is probably minor, but it simply means it's not the best solution :) This is due to the small number of literals. – Samy Arous Nov 15 '13 at 13:23
  • 1
    But that's where I got the above numbers. For an 8-character string I get no hash collisions at all. If you get 20% collisions for <= 65536 strings, something is very wrong. – DSM Nov 15 '13 at 13:51