I need to make a program that takes a file with a dictionary and an arbitrary string as an input and then outputs all combinations of words from that dictionary that make up anagrams of the given string.
For example, using the 100 most popular words of the English language and the string "i not work"
, I should get something like [' on it work', ' into work', ' not i work', ' know or it', ' work it no', ' to work in']
, which I do.
The problem is that my program is far too inefficient: with 100 words in the dictionary the practical limit is 7 characters for the string length, everything after that takes far too long. I tried looking for various algorithms related to the matter to no avail.
Here's how I search for anagrams:
def sortstring(string):
return ''.join(sorted(string))
def simplify(all_strings):
possible_strings = defaultdict(list)
for string in all_strings:
possible_strings[sortstring(string).strip()].append(string)
return possible_strings
def generate(database, length,curstring="", curdata=set()):
if len(curstring.replace(" ", "")) > length:
return set()
if len((curstring).replace(" ", "")) == length:
return curdata.union(set([curstring]))
for i in database:
if len((curstring+i).replace(" ", "")) <= length:
curdata = curdata.union(generate(database.difference(set([i])),
length, curstring+" "+i, curdata))
database = database.difference(set([i]))
return curdata
def analyse(database, input_string):
cletters = countstring(input_string)
strings = simplify(generate(database, cletters))
data = list()
sorted_string = sortstring(input_string).strip()
if sorted_string in strings.keys():
data = strings[sorted_string]
return len(strings.values()), data
def countstring(string):
a = countletters(string)
return sum(a.values())
def countletters(string):
result = {}
for i in ascii_lowercase:
result[i] = string.count(i)
return result
Can anyone suggest a way to improve on this? Though I suppose that the algorithm I used should be completely ditched given that the complexity seems far too high because of how slow it is. Just in case: the program should be efficient enough to support dictionaries of tens of thousands of words and strings up to tens of characters. That's far better than what I did.