8

I have a large database for solving crossword puzzles, consisting of a word and a description. My application allows searching for words of a specific length and characters on specific positions (this is done the hard way ... go through all words and check each). Plus a search by description (if necessary)

For instance find word _ _ A _ _ B (6 letter word, third character A and last B)

I would like to index the words in such way that the searching would be really fast. My first idea was to use a balanced tree structure, any other suggestion?

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
Drejc
  • 14,196
  • 16
  • 71
  • 106
  • I don't know if it's fast execution-wise but code-wise: I have puzzle generator that fills in a blank puzzle. To do this i just stored the words in length-specific ADO data tables and then created a view to get the matches. In your case, you already have the filter "__A__B" to create the view. The best part of this is I get to use my favorite code for a key part of my applicaiton: someone elses. – No Refunds No Returns Feb 18 '10 at 14:01
  • The application was actually written for my father which is a dedicated cross-word puzzler. He also manually checks and enters new words and descriptions. For his purpose the solution is OK. But I'm thinking to get his data out in some web based form, so the search part must be improved. – Drejc Feb 18 '10 at 14:35

5 Answers5

9

Okay, I am going to propose something weird, but coming from C++ I have been using Boost for a long time and I have come to see the MultiIndex library.

The idea of this library is to create one collection, but have many different ways to query it. It could model, in fact, a database.

So, let's put our words in a table, and put the necessary indexes in place:

word                     |length|c0|c1|c2| ... |c26|
-------------------------|------|--|--|--| ... |---|
Singapour                |9     |S |i |n | ... |0  |

Now the query will look like:

Select word From table Where length=9 And c2='n' And c8='u';

Easy enough isn't it ?

For maximum efficiency, the table should be partitioned on the length, and the indexes (one per cX column) should be local to the partition.

For an in-memory solution you would have one container per length, containing as many indexes as the length, each index being a hash table pointing to a sorted list (easier merge)

Here is a python description:

class Dictionary:
  def __init__(self, length):
    self.length = length
    self.words = set([])
    self.indexes = collections.defaultdict(set)

  def add(self, word):
    if len(word) != self.length:
      raise RuntimeException(word + ' is not ' + `self.length` + ' characters long')

    if word in self.words:
      raise RuntimeException(word + ' is already in the dictionary')

    self.words.add(word)

    for i in range(0,length):
      self.indexes[(i,word[i])].add(word)

  def search(self, list):
    """list: list of tuples (position,character)
    """
    def compare(lhs,rhs): return cmp(len(lhs),len(rhs))

    sets = [self.indexes[elem] for elem in list]
    sets.sort(compare)
    return reduce(intersection, sets)

I have voluntarily provided the length argument, to minimize the size of the hashes and thus make the search better. Also, the sets are sorted by length so that the computation of the intersection be better :)

Go ahead and test it against other solutions if you wish :)

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
4

This question: Good algorithm and data structure for looking up words with missing letters? started out exactly like the one you're asking, but then it was edited to something rather different and easier. Still, you can find some ideas there.

In short, everyone recommends loading the whole dictionary into memory and dividing the words into groups based on their length. From there, you can go many different directions. The more memory you are willing to use up, the faster you can go.

One nice suggestion is to keep a hash table of lists of words of a given length that have a given letter in a given position. You can build it like this (in Python):

# Build a whole lot of sorted word lists
wordlists = collections.defaultdict(list)
for word in sorted(all_words):
    for position, letter in enumerate(word):
        wordlists[len(word), position, letter].append(word)

Now if you need a 6-letter word ending in B, you can just ask for wordlists[6, 5, 'B'] and you've got the complete list. When you know more than one letter, as in ..A..B, you can pick whichever list is shortest and test each word against the desired pattern. My computer's dictionary only has 21 six-letter words ending with B, of which only SCARAB matches.

Community
  • 1
  • 1
Jason Orendorff
  • 42,793
  • 6
  • 62
  • 96
  • Reading the linked question/answers regex might also provide the solution, in case words are separated by length in advance. It's hard to tell which is the right way to go without any testing and some measuring. – Drejc Feb 19 '10 at 07:50
2

Since you use a database, create a Suffixes table.
For example :

  Suffix          |   WordID   | SN
  ----------------+------------+----   
  StackOverflow           10      1
  tackOverflow            10      2
  ackOverflow             10      3
  ckOverflow              10      4
  kOverflow               10      5
  ...

With that table it's easy to get all words that contain a particular char in a specific position,
like this:

SELECT WordID FROM suffixes
WHERE suffix >= 't' AND suffix < 'u' AND SN = 2

Get all words which contain 't' at position 2.

Update: if you want to save space, and sacrifice a bit of speed, you can use a suffix array.

You can store all the words in a line (array) with a separator among them, ie the $, and create a suffix array which will have pointers to chars. Now, given a char c you can find all instances of words which contain it rather fast. Still, you'll have to examine if it's in the right position.
(by checking how far it is from the $s)

Probably with the above technique the search will be x10 faster than searching all the words in your original program.

Update 2: I've used the database approach in one of my utilities where I needed to locate suffixes such as "ne", for example, and I forgot to adjust (optimize) it for this specific problem.

You can just store a single char as a suffix:

  Suffix   |   WordID   | SN
  ---------+------------+----   
  S                10      1
  t                10      2
  a                10      3
  c                10      4
  k                10      5
  ...

which saves a lot of space. Now, the query becomes

SELECT WordID FROM suffixes
WHERE suffix = 't' AND SN = 2
Nick Dandoulakis
  • 42,588
  • 16
  • 104
  • 136
  • @KennyTM, yes suffix, that's what I said :-) – Nick Dandoulakis Feb 18 '10 at 13:56
  • Would this require to add all suffixes into the database? If so the number of database entries would go way up. The factor of 10-15. Thus search would then be probably slower. Any suggestion on how to store and search such data, additionally to the database solution? – Drejc Feb 18 '10 at 14:06
  • @Drejc, yes, the suffixes will be stored in the database. IMO the factor will be 6-7. The `Suffix` should be a key, thus it'll have an index (even more space :), in that case the search will be fast. O(logn) if I'm not mistaken. – Nick Dandoulakis Feb 18 '10 at 14:13
  • 2
    @Drejc, If you want to store the data in the memory AND still want to exploit the power and easiness of a query language, I recommend SQLite which has the option to create a in-memory database. – Nick Dandoulakis Feb 18 '10 at 14:16
  • This is probably the way to go. Currently I'm looking on a set of around 150.000 words * 7 = 1.500.000 (but this will probably go up) – Drejc Feb 18 '10 at 14:27
  • @Drejc, 150.000 words * 7 = 1.050.000 :) Anyway, see my update. – Nick Dandoulakis Feb 18 '10 at 15:01
  • Ups typo ... yes 1.050.000 ;) will certainly give it a try and test out the one and the other. – Drejc Feb 18 '10 at 15:08
  • @Matthieu, yeah :) what's the proper english word, an infix? – Nick Dandoulakis Feb 19 '10 at 14:09
  • @Nick: being French... I am certainly not the one you ought to ask about vocabulary issues. I have enough difficulties as it is to make myself understood ;) – Matthieu M. Feb 20 '10 at 12:51
1

You can use a Suffix Tree, or a Trie.

Skurmedel
  • 21,515
  • 5
  • 53
  • 66
  • This is probably the correct way to go. Any suggestion on how to store this data in order to have a fast search. Not necessary a database - it can be a flat file or similar. – Drejc Feb 18 '10 at 14:04
  • 1
    I came to the conclusion that your problem is a much harder problem than my small brain can solve easily :) The Suffix Tree works with individual strings, so I'm a bit miffed when it comes to using it for searching multiple words. It will certainly be very quick when searching an individual string for substrings but I don't quite see how to apply it to all words like Nick's solution does. Thus, maybe my answer is a non-answer; hopefully it helps somewhat. – Skurmedel Feb 18 '10 at 14:16
1

You could store your information in a trie of some sort (perhaps a ternary search tree). An algorithm for partial search using a trie is described in section 6 of this paper by Sedgewick and Bentley. You of course want to have different tries for the various lengths of words. The paper says that the partial search algorithm requires a time of O(n^((k-s)/k)) for s letters being specified in a trie of n k-length words.

Justin Peel
  • 46,722
  • 6
  • 58
  • 80