5

I need to make a program that takes a file with a dictionary and an arbitrary string as an input and then outputs all combinations of words from that dictionary that make up anagrams of the given string. For example, using the 100 most popular words of the English language and the string "i not work", I should get something like [' on it work', ' into work', ' not i work', ' know or it', ' work it no', ' to work in'], which I do.

The problem is that my program is far too inefficient: with 100 words in the dictionary the practical limit is 7 characters for the string length, everything after that takes far too long. I tried looking for various algorithms related to the matter to no avail.

Here's how I search for anagrams:

def sortstring(string):
    return ''.join(sorted(string))

def simplify(all_strings):
    possible_strings = defaultdict(list)
    for string in all_strings:
        possible_strings[sortstring(string).strip()].append(string)
    return possible_strings

def generate(database, length,curstring="", curdata=set()):
    if len(curstring.replace(" ", "")) > length:
        return set()
    if len((curstring).replace(" ", "")) == length:
        return curdata.union(set([curstring]))
    for i in database:
        if len((curstring+i).replace(" ", "")) <= length:
            curdata = curdata.union(generate(database.difference(set([i])),
                length, curstring+" "+i, curdata))
            database = database.difference(set([i]))
    return curdata

def analyse(database, input_string):
    cletters = countstring(input_string)
    strings = simplify(generate(database, cletters))
    data = list()
    sorted_string = sortstring(input_string).strip()
    if sorted_string in strings.keys():
        data = strings[sorted_string]
    return len(strings.values()), data

def countstring(string):
    a = countletters(string)
    return sum(a.values())

def countletters(string):
    result = {}
    for i in ascii_lowercase:
        result[i] = string.count(i)
    return result

Can anyone suggest a way to improve on this? Though I suppose that the algorithm I used should be completely ditched given that the complexity seems far too high because of how slow it is. Just in case: the program should be efficient enough to support dictionaries of tens of thousands of words and strings up to tens of characters. That's far better than what I did.

Zenadix
  • 15,291
  • 4
  • 26
  • 41
Mashallah
  • 81
  • 5
  • A few tests: with a 2-letter string, the generation time is 0.00085s, with a 3-letter string, the generation time is 0.0039s, with a 4-letter string, the generation time is 0.018s, with a 5-letter string, the generation time is 0.05s, with a 6-letter string, the generation time is 0.48s, with a 7-letter string, the generation time is 4.2s. – Mashallah Dec 11 '15 at 11:01
  • In other words, each additional letter multiplies execution time by about 3 to 10 times. And that's with a 100-word dictionary – Mashallah Dec 11 '15 at 11:05
  • 1
    First write a function which identifies all words that can be formed from the letters of the string and then use this function in a back-tracking algorithm. – John Coleman Dec 11 '15 at 14:00
  • The first part is easy, but how would I backtrack it? – Mashallah Dec 11 '15 at 14:16
  • 1
    Sort of a tree traversal. Empty string is the root. Words in the dictionary that can be made from the letters are children. When you visit a node, the children of that node are the words that can be made from the remaining letters. Any such word also appears in the lists of possible words one level up -- so you should be able to tell which words are still possible very quickly. If you get to a node where there are remaining letters that can't be formed into any word -- back track. If you get to a node where no letters remain -- the path from root to that node is one of the anagrams you seek. – John Coleman Dec 11 '15 at 14:22
  • What would be the complexity of that algorithm? From the sound of it, I'd have to regenerate the list of possible words on every step, which would take a lot of time with a long dictionary as I'd have to go through all of it again every time. Or did I misunderstand you? – Mashallah Dec 11 '15 at 14:25
  • 1
    You don't have to regenerate the list from scratch at each stage -- as you move down the tree you throw possible words away, you don't add new ones. The whole dictionary only needs to be processed once. – John Coleman Dec 11 '15 at 14:28
  • I think I see what you mean. I'll try implementing this algorithm (leaving my original algorithm commented) and comparing the computation speed. – Mashallah Dec 11 '15 at 14:29
  • I implemented it and it resulted in only a slight optimisation. Using a 58000-word dictionary and the string "there is none" for a test, the program takes about 37 seconds to find all anagrams. – Mashallah Dec 11 '15 at 15:29
  • 1
    I beautified your code. The lack of whitespace was hurting my oversensitive eyes. – Zenadix Dec 11 '15 at 16:28

2 Answers2

3

I resolved a part of the issue myself. Resolved the for-if antipattern in the generator code:

def generate(database, length,letters,curstring="",curdata=set()):
if len(curstring.replace(" ",""))>length:
    return set()
if len((curstring).replace(" ",""))==length:
    return curdata.union(set([curstring]))
t=countletters(curstring)
for i in ascii_lowercase:
    if t[i]>letters[i]:
        return set()
for i in database:
    t=countletters(curstring+i)
    test=0
    for j in ascii_lowercase:
        if t[j]>letters[j]:
            test=1
    if test: continue
    if sum(t.values())<=length:
        curdata=curdata.union(generate(database.difference(set([i])),length,letters,curstring+" "+i,curdata))
        database=database.difference(set([i]))
return curdata

It is much, much faster now, but is still slow if the dictionary contains tens of thousands of words and/or if the input string is long.

Mashallah
  • 81
  • 5
0

Here is a recursive approach implementing the tree approach I suggested in the comments:

def frequencyDict(s):
    s = s.lower()
    d = {}
    for c in s:
        if c.isalpha():
            if c in d:
                d[c] += 1
            else:
                d[c] = 1
    return d

def canMake(w,fdict):
    d = frequencyDict(w)
    return all(d[c] <= fdict.get(c,0) for c in d)

def candidates(wlist,fdict):
    return [w for w in wlist if canMake(w,fdict)]

def anagrams(wlist,fdict):
    if len(wlist) == 0 or len(fdict) == 0:
        return "no anagrams"
    hits = []
    firstWords = candidates(wlist,fdict)
    if len(firstWords) == 0:
        return "no anagrams"
    for w in firstWords:
        #create reduced frequency dict
        d = fdict.copy() 
        for c in w:
            d[c] -= 1
            if d[c] == 0: del d[c]
        #if d is empty, the first word is also a the last word
        if len(d) == 0:
            hits.append(w)
        else:
            #create reduced word list
            rlist = [v for v in wlist if canMake(v,d)]
            tails = anagrams(rlist, d)
            if tails != "no anagrams":
                hits.extend(w + " " + t for t in tails)
    if len(hits) == 0:
        return "no anagrams"
    else:
        return hits

def findAnagrams(wlist,s):
    return anagrams(wlist,frequencyDict(s.lower()))

f = open("linuxwords.txt")
words = f.read().split('\n')
f.close()
words = [w.strip().lower() for w in words if not '-' in w]
test = findAnagrams(words, "Donald Trump")

It takes about 20 seconds to find all 730 anagrams of "Donald Trump" drawn from an old Linux word list. My favorite is "damp nut lord"

John Coleman
  • 51,337
  • 7
  • 54
  • 119
  • It seems slower than my current implementation. In a case of 2645 anagrams, the program took about 3 seconds to find all the anagrams. – Mashallah Dec 11 '15 at 17:30
  • @Mashallah I'm sure that there are optimizations. Also, my code finds all *ordered* anagrams ("damp nut lord", "lord nut damp", etc. for all 3! = 6) permutations. An optimization would be to write a function which returns *multisets* such as `{'damp','nut','lord'}` rather than *strings* – John Coleman Dec 11 '15 at 17:42