Filter a Set for Matching String Permutations

Question

I am trying to use itertools.permutations() to return all the permutations of the string and return only the ones which are members of a set of words.

import itertools

def permutations_in_dict(string, words): 
    '''
    Parameters
    ----------
    string : {str}
    words : {set}

    Returns
    -------
    list : {list} of {str}    

    Example
    -------
    >>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
    ['act', 'cat']
    '''

My current solution works fine in terminal but somehow couldn't pass the test case...

return list(set([''.join(p) for p in itertools.permutations(string)]) & words)

Any help will be appreciated.

What exactly is the test case? If you are comparing the results to `['act', 'cat']` perhaps you need to ignore the ordering and create a set. — Jacques Kvam, Jul 01 '17 at 06:25
Indeed my output is ['cat','act'] which does not match with ['act','cat']. The ordering of set is random, right? Then how can I ignore/match with it? @JacquesKvam — Meruemu, Jul 01 '17 at 06:32
How does it fail? Time could be a problem, creating all permutations of `string` explodes quite rapidly with the length of the string. Or if order matters then you may need to `sorted(...)` the result. — AChampion, Jul 01 '17 at 06:46
I just posted a comparative analysis of the various approaches. It turns out that for small ``len(string)``, @Meruemu had the fastest approach by using set-intersection to search permutations of the target string. For a little bit larger sizes of ``len(string)``, the sort-and-compare approach is best. The Counter/multiset solution is second-best in all normal cases due to the overhead of hashing. However, the Counter/multiset approach would eventually beat sort-and-compare if all the inputs strings were *very* large. — Raymond Hettinger, Jul 01 '17 at 07:56
Sorting the results is one way to satisfy the testcase. However if it's required it should say so in the description. On the otherhand, if the order of the results arbitrary you can alter the testcase to apply set to the result. Either way you should seek clarification from your client — John La Rooy, Jul 07 '17 at 21:27

Raymond Hettinger · Answer 1 · 2017-07-01T13:15:07.947

Problem Category

The problem you're solving is best described as testing for anagram matches.

Solution using Sort

The traditional solution is to sort the target string, sort the candidate string, and test for equality.

>>> def permutations_in_dict(string, words):
        target = sorted(string)
        return sorted(word for word in words if sorted(word) == target)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

Solution using Multisets

Another approach is to use collections.Counter() to make a multiset equality test. This is algorithmically superior to the sort solution (O(n) versus O(n log n)) but tends to lose unless the size of the strings is large (due to the cost of hashing all the characters).

>>> def permutations_in_dict(string, words):
        target = Counter(string)
        return sorted(word for word in words if Counter(word) == target)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

Solution using a Perfect Hash

A unique anagram signature or perfect hash can be constructed by multiplying prime numbers corresponding to each possible character in a string.

The commutative property of multiplication guarantees that the hash value will be invariant for any permutation of a single string. The uniqueness of the hash value is guaranteed by the fundamental theorem of arithmetic (also known as the unique prime factorization theorem).

>>> from operator import mul
>>> primes = [2, 3, 5, 7, 11]
>>> primes += [p for p in range(13, 1620) if all(pow(b, p-1, p) == 1 for b in (5, 11))]
>>> anagram_hash = lambda s: reduce(mul, (primes[ord(c)] for c in s))
>>> def permutations_in_dict(string, words):
        target = anagram_hash(string)
        return sorted(word for word in words if anagram_hash(word) == target)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

Solution using Permutations

Searching by permutations on the target string using itertools.permutations() is reasonable when the string is small (generating permutations on a n length string generates n factorial candidates).

The good news is that when n is small and the number of words is large, this approach runs very fast (because set membership testing is O(1)):

>>> from itertools import permutations
>>> def permutations_in_dict(string, words):
        perms = set(map(''.join, permutations(string)))
        return sorted(word for word in words if word in perms)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

As the OP surmised, the pure python search loop can be sped-up to c-speed by using set.intersection():

>>> def permutations_in_dict(string, words):
        perms = set(map(''.join, permutations(string)))
        return sorted(words & perms)

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['act', 'cat']

Best Solution

Which solution is best depends on the length of string and the length of words. Timings will show which is best for a particular problem.

Here are some comparative timings for the various approaches using two different string sizes:

Timings with string_size=5 and words_size=1000000
-------------------------------------------------
0.01406    match_sort
0.06827    match_multiset
0.02167    match_perfect_hash
0.00224    match_permutations
0.00013    match_permutations_set

Timings with string_size=20 and words_size=1000000
--------------------------------------------------
2.19771    match_sort
8.38644    match_multiset
4.22723    match_perfect_hash
<takes "forever"> match_permutations
<takes "forever"> match_permutations_set

The results indicate that for small strings, the fastest approach searches permutations on the target string using set-intersection.

For larger strings, the fastest approach is the traditional sort-and-compare solution.

Hope you found this little algorithmic study as interesting as I have. The take-aways are:

Sets, itertools, and collections make short work of problems like this.
Big-oh running times matter (n-factorial disintegrates for large n).
Constant overhead matters (sorting beats multisets because of hashing overhead).
Discrete mathematics is a treasure trove of ideas.
It is hard to know what is best until you do analysis and run timings :-)

Timing Set-up

FWIW, here is a test set-up I used to run the comparative timings:

from collections import Counter
from itertools import permutations
from string import letters
from random import choice
from operator import mul
from time import time

def match_sort(string, words):
    target = sorted(string)
    return sorted(word for word in words if sorted(word) == target)

def match_multiset(string, words):
    target = Counter(string)
    return sorted(word for word in words if Counter(word) == target)

primes = [2, 3, 5, 7, 11]
primes += [p for p in range(13, 1620) if all(pow(b, p-1, p) == 1 for b in (5, 11))]
anagram_hash = lambda s: reduce(mul, (primes[ord(c)] for c in s))

def match_perfect_hash(string, words):
    target = anagram_hash(string)
    return sorted(word for word in words if anagram_hash(word) == target)

def match_permutations(string, words):
    perms = set(map(''.join, permutations(string)))
    return sorted(word for word in words if word in perms)

def match_permutations_set(string, words):
    perms = set(map(''.join, permutations(string)))
    return sorted(words & perms)

string_size = 5
words_size = 1000000

population = letters[: string_size+2]
words = set()
for i in range(words_size):
    word = ''.join([choice(population) for i in range(string_size)])
    words.add(word)
string = word                # Arbitrarily search use the last word as the target

print 'Timings with string_size=%d and words_size=%d' % (string_size, words_size)
for func in (match_sort, match_multiset, match_perfect_hash, match_permutations, match_permutations_set):
    start = time()
    func(string, words)
    end = time()
    print '%-10.5f %s' % (end - start, func.__name__)

That simple traditional sort-and-compare solution is not fastest anymore, others are up to [6 times faster](https://stackoverflow.com/a/72906786/12671057) in your (slightly modified) benchmarks. — Kelly Bundy, Jul 08 '22 at 05:30

AChampion · Accepted Answer · 2017-07-01T06:54:19.100

You can simply use collections.Counter() to compare the words to the string without creating all permutations (this explodes with length of string):

from collections import Counter

def permutations_in_dict(string, words):
    c = Counter(string)
    return [w for w in words if c == Counter(w)]

>>> permutations_in_dict('act', {'cat', 'rat', 'dog', 'act'})
['cat', 'act']

Note: sets are unordered so if you need a specific order you may need to sort the result, e.g. return sorted(...)

score 3 · Answer 3 · answered Jul 01 '17 at 06:32

3

Apparently you're expecting output to be sorted alphabetically, so this should do:

return sorted(set(''.join(p) for p in itertools.permutations(string)) & words)

answered Jul 01 '17 at 06:32

Błotosmętek

12,717
19
29

score 1 · Answer 4 · answered Jul 01 '17 at 06:33

1

Try this solution

list(map("".join, itertools.permutations('act')))
['act', 'atc', 'cat', 'cta', 'tac', 'tca']

We can call it listA

listA = list(map("".join, itertools.permutations('act')))

Your list is ListB

listB = ['cat', 'rat', 'dog', 'act']

Then use set intersection

list(set(listA) & set(listB))
['cat', 'act']

answered Jul 01 '17 at 06:33

MishaVacic

1,812
8
25
29

1

Why go through the added step of converting everything to lists? You might as well use set literals (`{'cat', 'rat', 'dog', 'act'}`) and leave out the `list(...)` for `listA` – Synthetica Sep 11 '17 at 11:44

Kelly Bundy · Answer 5 · 2022-07-08T06:00:07.023

We can be a lot faster, at the expense of longer code. Reusing Raymond's benchmark and solutions (mine are prefixed with Kelly_):

Timings with string_size=20 and words_size=100000
154 ms ±  3 ms  match_sort
 31 ms ±  3 ms  Kelly_match_sort
291 ms ± 12 ms  match_multiset
 26 ms ±  0 ms  Kelly_match_counts
424 ms ± 15 ms  match_perfect_hash
207 ms ±  7 ms  Kelly_match_perfect_hash
164 ms ±  9 ms  Kelly_match_perfect_hash2

Timings with string_size=5 and words_size=100000
  7 ms ±  0 ms  match_sort
  3 ms ±  0 ms  Kelly_match_sort
 26 ms ±  0 ms  match_multiset
  3 ms ±  0 ms  Kelly_match_counts
 21 ms ±  0 ms  match_perfect_hash
 10 ms ±  0 ms  Kelly_match_perfect_hash
  9 ms ±  0 ms  Kelly_match_perfect_hash2

My Kelly_match_sort is like Raymond's match_sort, but for each word, I first check whether its count of the search string's most common letter matches. Only if it does, I then also do the sorting check. In the above two benchmarks, this pre-check already rules out about 94% and 86% of the words, respectively.

My Kelly_match_counts is similar to match_multiset in that it compares letter counts. But instead of using Counter(word) to count all letters, I count them individually with word.count(). From most to least common letter in the search string. And as soon as I find a mismatch, I reject the word and move on to the next word. As mentioned above, very often that already happens at the first letter.

My Kelly_match_perfect_hash is like match_perfect_hash, but uses math.prod instead of reduce with mul, and maps letters to primes directly, using a dictionary (instead of going through ord to index a list). And I use map with the dictionary's get method instead of a generator expressions.

The Kelly_match_perfect_hash2 version maps only letters, so it doesn't waste the smallest primes on characters that don't even appear in words.

Full code (Try it online!):

from collections import Counter
from itertools import permutations
from string import ascii_letters as letters
from random import choice
from operator import mul
from time import time
from functools import reduce
from math import prod
from statistics import mean, stdev

def match_sort(string, words):
    target = sorted(string)
    return sorted(word for word in words if sorted(word) == target)

def match_multiset(string, words):
    target = Counter(string)
    return sorted(word for word in words if Counter(word) == target)

primes = [2, 3, 5, 7, 11]
primes += [p for p in range(13, 1620) if all(pow(b, p-1, p) == 1 for b in (5, 11))]
anagram_hash = lambda s: reduce(mul, (primes[ord(c)] for c in s))

def match_perfect_hash(string, words):
    target = anagram_hash(string)
    return sorted(word for word in words if anagram_hash(word) == target)


def Kelly_match_sort(string, words):
    if not string:
        return [''] * words.count('')
    K = max(string, key=string.count)
    V = string.count(K)
    target = sorted(string)
    return sorted(
        word for word in words
        if word.count(K) == V
        if sorted(word) == target
    )


def Kelly_match_counts(string, words):
    if not string:
        return [''] * words.count('')
    (K, V), *kvs = Counter(string).most_common()
    matches = []
    for word in words:
        if word.count(K) == V:
            for k, v in kvs:
                if word.count(k) != v:
                    break
            else:
                if len(word) == len(string):
                    matches.append(word)
    matches.sort()
    return matches


primes2 = {chr(i): p for i, p in enumerate(primes)}.get
anagram_hash2 = lambda s: prod(map(primes2, s))

def Kelly_match_perfect_hash(string, words):
    target = anagram_hash2(string)
    return sorted(word for word in words if anagram_hash2(word) == target)


primes3 = dict(zip(letters, primes)).get
anagram_hash3 = lambda s: prod(map(primes3, s))

def Kelly_match_perfect_hash2(string, words):
    target = anagram_hash3(string)
    return sorted(word for word in words if anagram_hash3(word) == target)


funcs = [
    match_sort,
    Kelly_match_sort,
    match_multiset,
    Kelly_match_counts,
    match_perfect_hash,
    Kelly_match_perfect_hash,
    Kelly_match_perfect_hash2,
]

string_size = 20
words_size = 100000

print('Timings with string_size=%d and words_size=%d' % (string_size, words_size))

times = {func: [] for func in funcs}
for _ in range(10):
    population = letters[: 
    string_size+2]
    words = set()
    for i in range(words_size):
        word = ''.join([choice(population) for i in range(string_size)])
        words.add(word)
    string = word                # Arbitrarily search use the last word as the target

    for func in funcs:
        start = time()
        func(string, words)
        end = time()
        times[func].append(end - start)

for func in funcs:
    ts = [t * 1e3 for t in times[func]]
    print('%3d ms ± %2d ms ' % (mean(ts), stdev(ts)), func.__name__)

score -1 · Answer 6 · answered Jan 31 '18 at 04:14

Why even bother with permutations? This is a much simpler problem if you look at the words as dictionaries of letters. I'm sure that there's a comprehension to do it better than this, but:

    letters = dict()
    for i in word:
      letters[i] = letters.get(i, 0) + 1

do this for the word then for each word in the set, make sure that the value for each key is greater than or equal to the value of that word's key. If it is, add it to your output.

Added bonus: this should be easy to parallelize if your list of words is exceedingly long.