Problem Category
The problem you're solving is best described as testing for anagram matches.
Solution using Sort
The traditional solution is to sort the target string, sort the candidate string, and test for equality.
>>> def permutations_in_dict(string, words):
target = sorted(string)
return sorted(word for word in words if sorted(word) == target)
>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']
Solution using Multisets
Another approach is to use collections.Counter() to make a multiset equality test. This is algorithmically superior to the sort solution (O(n)
versus O(n log n)
) but tends to lose unless the size of the strings is large (due to the cost of hashing all the characters).
>>> def permutations_in_dict(string, words):
target = Counter(string)
return sorted(word for word in words if Counter(word) == target)
>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']
Solution using a Perfect Hash
A unique anagram signature or perfect hash can be constructed by multiplying prime numbers corresponding to each possible character in a string.
The commutative property of multiplication guarantees that the hash value will be invariant for any permutation of a single string. The uniqueness of the hash value is guaranteed by the fundamental theorem of arithmetic (also known as the unique prime factorization theorem).
>>> from operator import mul
>>> primes = [2, 3, 5, 7, 11]
>>> primes += [p for p in range(13, 1620) if all(pow(b, p-1, p) == 1 for b in (5, 11))]
>>> anagram_hash = lambda s: reduce(mul, (primes[ord(c)] for c in s))
>>> def permutations_in_dict(string, words):
target = anagram_hash(string)
return sorted(word for word in words if anagram_hash(word) == target)
>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']
Solution using Permutations
Searching by permutations on the target string using itertools.permutations() is reasonable when the string is small (generating permutations on a n length string generates n factorial candidates).
The good news is that when n is small and the number of words is large, this approach runs very fast (because set membership testing is O(1)):
>>> from itertools import permutations
>>> def permutations_in_dict(string, words):
perms = set(map(''.join, permutations(string)))
return sorted(word for word in words if word in perms)
>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']
As the OP surmised, the pure python search loop can be sped-up to c-speed by using set.intersection():
>>> def permutations_in_dict(string, words):
perms = set(map(''.join, permutations(string)))
return sorted(words & perms)
>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']
Best Solution
Which solution is best depends on the length of string and the length of words. Timings will show which is best for a particular problem.
Here are some comparative timings for the various approaches using two different string sizes:
Timings with string_size=5 and words_size=1000000
-------------------------------------------------
0.01406 match_sort
0.06827 match_multiset
0.02167 match_perfect_hash
0.00224 match_permutations
0.00013 match_permutations_set
Timings with string_size=20 and words_size=1000000
--------------------------------------------------
2.19771 match_sort
8.38644 match_multiset
4.22723 match_perfect_hash
<takes "forever"> match_permutations
<takes "forever"> match_permutations_set
The results indicate that for small strings, the fastest approach searches permutations on the target string using set-intersection.
For larger strings, the fastest approach is the traditional sort-and-compare solution.
Hope you found this little algorithmic study as interesting as I have. The take-aways are:
- Sets, itertools, and collections make short work of problems like this.
- Big-oh running times matter (n-factorial disintegrates for large n).
- Constant overhead matters (sorting beats multisets because of hashing overhead).
- Discrete mathematics is a treasure trove of ideas.
- It is hard to know what is best until you do analysis and run timings :-)
Timing Set-up
FWIW, here is a test set-up I used to run the comparative timings:
from collections import Counter
from itertools import permutations
from string import letters
from random import choice
from operator import mul
from time import time
def match_sort(string, words):
target = sorted(string)
return sorted(word for word in words if sorted(word) == target)
def match_multiset(string, words):
target = Counter(string)
return sorted(word for word in words if Counter(word) == target)
primes = [2, 3, 5, 7, 11]
primes += [p for p in range(13, 1620) if all(pow(b, p-1, p) == 1 for b in (5, 11))]
anagram_hash = lambda s: reduce(mul, (primes[ord(c)] for c in s))
def match_perfect_hash(string, words):
target = anagram_hash(string)
return sorted(word for word in words if anagram_hash(word) == target)
def match_permutations(string, words):
perms = set(map(''.join, permutations(string)))
return sorted(word for word in words if word in perms)
def match_permutations_set(string, words):
perms = set(map(''.join, permutations(string)))
return sorted(words & perms)
string_size = 5
words_size = 1000000
population = letters[: string_size+2]
words = set()
for i in range(words_size):
word = ''.join([choice(population) for i in range(string_size)])
words.add(word)
string = word # Arbitrarily search use the last word as the target
print 'Timings with string_size=%d and words_size=%d' % (string_size, words_size)
for func in (match_sort, match_multiset, match_perfect_hash, match_permutations, match_permutations_set):
start = time()
func(string, words)
end = time()
print '%-10.5f %s' % (end - start, func.__name__)