9

Given a dictionary of words and an initial character. find the longest possible word in the dictionary by successively adding a character to the word. At any given instance the word should be valid word in the dictionary.

ex : a -> at -> cat -> cart -> chart ....

AlgoMan
  • 2,785
  • 6
  • 34
  • 40
  • Interesting problem. What have you tried so far, and where are you stuck? – Michael Petrotta Mar 28 '10 at 18:55
  • Define "dictionary of words". Is it a hash table, a trie or what? If it's a trie, a simple DF search will work. Is it a suffix tree as your tags suggest? – IVlad Mar 28 '10 at 19:00
  • 2
    Sounds like a homework problem, poorly specified. – Trey Jackson Mar 28 '10 at 19:02
  • I believe it's actually an Amazon interview question. – user287792 Mar 28 '10 at 19:05
  • 1
    I don't see a problem with this question at all. – FogleBird Mar 28 '10 at 19:06
  • @lVlad DF? One feature of the Dictionary would give you if the word is valid or not(Hash table assumption). I thought using a Suffix tree or Trie for dictionary would help to check at each instance if the word is valid or not. Data-structure is flexible. @Michael I thought of doing it in a reverse manner. Sort the words from the dictionary according to length. Take the lengthiest word, try deleting the character from the word and keep on checking if this is a valid word or not. This is more of a brute force method. – AlgoMan Mar 28 '10 at 19:12
  • If you cached the results, it wouldn't be a bad method at all... – user287792 Mar 28 '10 at 19:25
  • AlgoMan, more context and your own thoughts on the problem would help make this feel less of a "please solve my problem for me" request, and more of a directed question. – Ziv Mar 28 '10 at 19:53
  • @Ziv yeah. i got that. I am very new here. I will try explaining myself much more clearer and will post my solution , what i tried so far. – AlgoMan Mar 28 '10 at 20:18
  • If it is a homework question, you should read this: http://meta.stackexchange.com/questions/10811/homework-on-stackoverflow That said, I strongly subscribe to this philosophy: "Conflicts with a school or instructor's policy on outside help are the responsibility of the student asking the question, not the SO community." So let's stop whining about it already. – FogleBird Mar 28 '10 at 20:21
  • I don't think a trie really helps here as the new letter is not always appended – jk. Mar 29 '10 at 10:28
  • @AlgoMan, What's the verdict? – FogleBird Mar 29 '10 at 21:20
  • It's not "difficult to tell what's being asked here" at all. You people are nuts. – FogleBird Mar 30 '10 at 02:11

3 Answers3

11

The brute force approach would be to try adding letters to each available index using a depth-first search.

So, starting with 'a', there are two places you can add a new letter. In front or behind the 'a', represented by dots below.

.a.

If you add a 't', there are now three positions.

.a.t.

You can try adding all 26 letters to each available position. The dictionary in this case can be a simple hashtable. If you add a 'z' in the middle, you get 'azt' which would not be in the hashtable so you don't continue down that path in the search.

Edit: Nick Johnson's graph made me curious what a graph of all maximal paths would look like. It's a large (1.6 MB) image here:

http://www.michaelfogleman.com/static/images/word_graph.png

Edit: Here's a Python implementation. The brute-force approach actually runs in a reasonable amount of time (a few seconds, depending on the starting letter).

import heapq

letters = 'abcdefghijklmnopqrstuvwxyz'

def search(words, word, path):
    path.append(word)
    yield tuple(path)
    for i in xrange(len(word)+1):
        before, after = word[:i], word[i:]
        for c in letters:
            new_word = '%s%s%s' % (before, c, after)
            if new_word in words:
                for new_path in search(words, new_word, path):
                    yield new_path
    path.pop()

def load(path):
    result = set()
    with open(path, 'r') as f:
        for line in f:
            word = line.lower().strip()
            result.add(word)
    return result

def find_top(paths, n):
    gen = ((len(x), x) for x in paths)
    return heapq.nlargest(n, gen)

if __name__ == '__main__':
    words = load('TWL06.txt')
    gen = search(words, 'b', [])
    top = find_top(gen, 10)
    for path in top:
        print path

Of course, there will be a lot of ties in the answer. This will print the top N results, measured by length of the final word.

Output for starting letter 'a', using the TWL06 Scrabble dictionary.

(10, ('a', 'ta', 'tap', 'tape', 'taped', 'tamped', 'stamped', 'stampede', 'stampedes', 'stampeders'))
(10, ('a', 'ta', 'tap', 'tape', 'taped', 'tamped', 'stamped', 'stampede', 'stampeder', 'stampeders'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'strangle', 'strangles', 'stranglers'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'strangle', 'strangler', 'stranglers'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'stranges', 'strangles', 'stranglers'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'stranges', 'strangers', 'stranglers'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'stranges', 'strangers', 'estrangers'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'stranges', 'estranges', 'estrangers'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'stranger', 'strangler', 'stranglers'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'stranger', 'strangers', 'stranglers'))

And here are the results for each starting letter. Of course, an exception is made that the single starting letter doesn't have to be in the dictionary. Just some 2-letter word that can be formed with it.

(10, ('a', 'ta', 'tap', 'tape', 'taped', 'tamped', 'stamped', 'stampede', 'stampedes', 'stampeders'))
(9, ('b', 'bo', 'bos', 'bods', 'bodes', 'bodies', 'boodies', 'bloodies', 'bloodiest'))
(1, ('c',))
(10, ('d', 'od', 'cod', 'coed', 'coped', 'comped', 'compted', 'competed', 'completed', 'complected'))
(10, ('e', 're', 'rue', 'ruse', 'ruses', 'rouses', 'arouses', 'carouses', 'carousels', 'carrousels'))
(9, ('f', 'fe', 'foe', 'fore', 'forge', 'forges', 'forgoes', 'forgoers', 'foregoers'))
(10, ('g', 'ag', 'tag', 'tang', 'stang', 'strang', 'strange', 'strangle', 'strangles', 'stranglers'))
(9, ('h', 'sh', 'she', 'shes', 'ashes', 'sashes', 'slashes', 'splashes', 'splashers'))
(11, ('i', 'pi', 'pin', 'ping', 'oping', 'coping', 'comping', 'compting', 'competing', 'completing', 'complecting'))
(7, ('j', 'jo', 'joy', 'joky', 'jokey', 'jockey', 'jockeys'))
(9, ('k', 'ki', 'kin', 'akin', 'takin', 'takins', 'takings', 'talkings', 'stalkings'))
(10, ('l', 'la', 'las', 'lass', 'lassi', 'lassis', 'lassies', 'glassies', 'glassines', 'glassiness'))
(10, ('m', 'ma', 'mas', 'mars', 'maras', 'madras', 'madrasa', 'madrassa', 'madrassas', 'madrassahs'))
(11, ('n', 'in', 'pin', 'ping', 'oping', 'coping', 'comping', 'compting', 'competing', 'completing', 'complecting'))
(10, ('o', 'os', 'ose', 'rose', 'rouse', 'rouses', 'arouses', 'carouses', 'carousels', 'carrousels'))
(11, ('p', 'pi', 'pin', 'ping', 'oping', 'coping', 'comping', 'compting', 'competing', 'completing', 'complecting'))
(3, ('q', 'qi', 'qis'))
(10, ('r', 're', 'rue', 'ruse', 'ruses', 'rouses', 'arouses', 'carouses', 'carousels', 'carrousels'))
(10, ('s', 'us', 'use', 'uses', 'ruses', 'rouses', 'arouses', 'carouses', 'carousels', 'carrousels'))
(10, ('t', 'ti', 'tin', 'ting', 'sting', 'sating', 'stating', 'estating', 'restating', 'restarting'))
(10, ('u', 'us', 'use', 'uses', 'ruses', 'rouses', 'arouses', 'carouses', 'carousels', 'carrousels'))
(1, ('v',))
(9, ('w', 'we', 'wae', 'wake', 'wakes', 'wackes', 'wackest', 'wackiest', 'whackiest'))
(8, ('x', 'ax', 'max', 'maxi', 'maxim', 'maxima', 'maximal', 'maximals'))
(8, ('y', 'ye', 'tye', 'stye', 'styed', 'stayed', 'strayed', 'estrayed'))
(8, ('z', 'za', 'zoa', 'zona', 'zonae', 'zonate', 'zonated', 'ozonated'))
FogleBird
  • 74,300
  • 25
  • 125
  • 131
  • 1
    +1. Cool to attack problems like this always with the bruteforce technique. Wat would be the order of the above solution – bragboy Mar 28 '10 at 19:29
  • Nice solution! Not the very most efficient since you have to test `26^n` possibilities for a solution of length `n-1`--which is way fewer than the number of words of length `n` if the word is not very short--but it definitely gets the job done. – Rex Kerr Mar 28 '10 at 19:48
  • Someone minused me without comment. Kinda funny considering I have a complete working solution. Way to go SO! – FogleBird Mar 28 '10 at 19:49
  • It's not 26^n. Most intermediate "nodes" in the search tree fail, pruning the search space significantly. If that wasn't the case this wouldn't work. – FogleBird Mar 28 '10 at 19:51
  • I believe the order is O(N^M), where N is the average number of children in the search tree (probably around 4 for an English dictionary) and M is the average length of the words in the dictionary. – FogleBird Mar 28 '10 at 20:19
  • I just checked with my program. Depending on the starting letter, there are only 14,858 leaves in the search tree on average. Max 80,116 (for the letter 'a'). – FogleBird Mar 28 '10 at 20:25
  • @Fogle: My mistake; it's only `26*(n+1)*N(n)` for level `n`, where `N(n)` is the number of words that pass at length `n`. Still not the most efficient possible, but definitely good enough. – Rex Kerr Mar 28 '10 at 20:40
  • Impressive graph. I'd still like to see the graph with just the minimal set of end nodes, though. :) – Nick Johnson Mar 29 '10 at 14:00
  • What do you mean by "minimal set of end nodes"? – FogleBird Mar 29 '10 at 14:09
  • A node is an 'end node' if it has no children. Given that some nodes are end nodes for several letters, there must be a selection of a set of end nodes such that every letter has a (maximal length) end node, but the set of all end nodes is smallest possible. – Nick Johnson Mar 29 '10 at 14:30
  • Ah, I see. That sounds tricky. I'll have to think about it more. – FogleBird Mar 29 '10 at 14:39
  • 1
    Link to the image is dead. – Ben Fulton Jul 17 '14 at 03:04
4

If you want to do this once, I'd do the following (generalized to the problem of starting with a full word):

Take your entire dictionary and throw away anything that does not have a superset of the characters in your target word (let's say it has length m). Then bin the remaining words by length. For each word of length m+1, try dropping each letter and see if that yields your desired word. If not, toss it. Then check each word of length m+2 against the valid set of length m+1, dropping any that can't be reduced. Keep going until you find an empty set; the last thing(s) you found will be the longest.

If you want to make this fast to look up, I'd build a suffix-tree-like data structure.

Group all words by length. For each word of length 2, place each of its two characters in a "subword" set, and add that word to each of the characters' "superword" sets. Now you've got a link between all valid words of length 2 and all characters. Do the same with words of length 3 and valid words of length 2. Now you can start anywhere in this hierarchy and do a breadth-first search to find the deepest branch.


Edit: the speed of this solution will depend greatly on the structure of the language, but if we decide to build everything using sets with log(n) performance for all operations (i.e. we use red-black trees or the like), and we have N(m) words of length m, then to form the link between words of length m+1 and m will approximately (m+1)*m*N(m+1)*log(N(m)) time (taking into account that string compares take linear time in the length of the string). Since we have to do this for all word lengths, the runtime for building the full data structure will be something on the order of

(typical word length)^3 * (dictionary length) * log (dictionary length / word length)

(The initial binning into words of a certain length will take linear time so can be neglected; the actual formula for runtime is complicated because it depends on the distribution of word lengths; for the case where you're doing it from a single word it's even more complicated because it depends on the expected number of longer words that have shorter subwords.)

Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
4

Assuming you need to do this repeatedly (or you want the answer for every one of the 26 letters), do it backwards:

  1. Load a dictionary, and sort it by length, descending
  2. Establish a mapping, initially empty, between words and (extension, max_len) tuples.
  3. For each word in the sorted list:
    1. If it's already in the mapping, retrieve the max len.
    2. If it's not, set the max len to the word length.
    3. Examine each word produced by deleting a character. If that word is not in the mapping, or our max_len exceeds the max_len of the word already in the mapping, update the mapping with the current word and max_len

Then, to get the chain for a given prefix, simply start with that prefix and repeatedly look it and its extensions up in the dictionary.

Here's the sample Python code:

words = [x.strip().lower() for x in open('/usr/share/dict/words')]
words.sort(key=lambda x:len(x), reverse=True)
word_map = {}  # Maps words to (extension, max_len) tuples

for word in words:
  if word in word_map:
    max_len = word_map[word][1]
  else:
    max_len = len(word)
  for i in range(len(word)):
    new_word = word[:i] + word[i+1:]
    if new_word not in word_map or word_map[new_word][2] < max_len:
      word_map[new_word] = (word, max_len)

# Get a chain for each letter
for term in "abcdefghijklmnopqrstuvwxyz":
  chain = [term]
  while term in word_map:
    term = word_map[term][0]
    chain.append(term)
  print chain

And its output for each letter of the alphabet:

['a', 'ah', 'bah', 'bach', 'brach', 'branch', 'branchi', 'branchia', 'branchiae', 'branchiate', 'abranchiate']
['b', 'ba', 'bac', 'bach', 'brach', 'branch', 'branchi', 'branchia', 'branchiae', 'branchiate', 'abranchiate']
['c', 'ca', 'cap', 'camp', 'campo', 'campho', 'camphor', 'camphory', 'camphoryl', 'camphoroyl']
['d', 'ad', 'cad', 'card', 'carid', 'carida', 'caridea', 'acaridea', 'acaridean']
['e', 'er', 'ser', 'sere', 'secre', 'secret', 'secreto', 'secretor', 'secretory', 'asecretory']
['f', 'fo', 'fot', 'frot', 'front', 'afront', 'affront', 'affronte', 'affronted']
['g', 'og', 'log', 'logy', 'ology', 'oology', 'noology', 'nosology', 'nostology', 'gnostology']
['h', 'ah', 'bah', 'bach', 'brach', 'branch', 'branchi', 'branchia', 'branchiae', 'branchiate', 'abranchiate']
['i', 'ai', 'lai', 'lain', 'latin', 'lation', 'elation', 'delation', 'dealation', 'dealbation']
['j', 'ju', 'jug', 'juga', 'jugal', 'jugale']
['k', 'ak', 'sak', 'sake', 'stake', 'strake', 'straked', 'streaked']
['l', 'la', 'lai', 'lain', 'latin', 'lation', 'elation', 'delation', 'dealation', 'dealbation']
['m', 'am', 'cam', 'camp', 'campo', 'campho', 'camphor', 'camphory', 'camphoryl', 'camphoroyl']
['n', 'an', 'lan', 'lain', 'latin', 'lation', 'elation', 'delation', 'dealation', 'dealbation']
['o', 'lo', 'loy', 'logy', 'ology', 'oology', 'noology', 'nosology', 'nostology', 'gnostology']
['p', 'pi', 'pig', 'prig', 'sprig', 'spring', 'springy', 'springly', 'sparingly', 'sparringly']
['q']
['r', 'ra', 'rah', 'rach', 'brach', 'branch', 'branchi', 'branchia', 'branchiae', 'branchiate', 'abranchiate']
['s', 'si', 'sig', 'spig', 'sprig', 'spring', 'springy', 'springly', 'sparingly', 'sparringly']
['t', 'ut', 'gut', 'gutt', 'gutte', 'guttle', 'guttule', 'guttulae', 'guttulate', 'eguttulate']
['u', 'ut', 'gut', 'gutt', 'gutte', 'guttle', 'guttule', 'guttulae', 'guttulate', 'eguttulate']
['v', 'vu', 'vum', 'ovum']
['w', 'ow', 'low', 'alow', 'allow', 'hallow', 'shallow', 'shallowy', 'shallowly']
['x', 'ox', 'cox', 'coxa', 'coxal', 'coaxal', 'coaxial', 'conaxial']
['y', 'ly', 'loy', 'logy', 'ology', 'oology', 'noology', 'nosology', 'nostology', 'gnostology']
['z', 'za', 'zar', 'izar', 'izard', 'izzard', 'gizzard']

Edit: Given the degree to which branches merge towards the end, I thought it would be interesting to draw a graph to demonstrate this:

Graph

An interesting extension of this challenge: It's likely there are several equilength final words for some letters. Which set of chains minimizes the number of final nodes (eg, merges the most letters)?

Community
  • 1
  • 1
Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • Nice. About 6-8x faster than mine. But this only yields one path for each starting letter, while mine yields all 380,000 possible paths (for all 26 letters combined). So ultimately it depends on what the OP would need the algorithm for. (P.S. 'abranchiate' isn't in my dictionary!) – FogleBird Mar 29 '10 at 12:17
  • Quite true. You could have it yield all paths, or just all max-length paths, by storing a list of extensions against each term, instead of just the longest one found so far. As far as the dictionary goes, I'm just using the one in /usr/share/dict/words on OSX 10.5. :) – Nick Johnson Mar 29 '10 at 13:06
  • +1 for the graph idea. Check out my answer for a graph of all maximal paths. :) – FogleBird Mar 29 '10 at 13:59