Simple word autocomplete just displays a list of words that match the characters that were already typed. But I would like to order the words in the autocomplete list according to the probability of the words occuring, depending on the words that were typed before, relying on a statistical model of a text corpus. What algorithms and data structures do I need for this? Can you give me links for good tutorials?
2 Answers
You don't need probability for autocompletion. Instead, build a prefix tree (aka a trie) with the words in the corpus as keys and their frequencies as values. When you encounter a partial string, walk the trie as far as you can, then generate all the suffixes from the point you've reached and sort them by frequency.
When a user enters a previously unseen string, just add it to the trie with frequency one; when a user enters a string that you had seen (perhaps by selecting it from the candidate list), increment its frequency.
[Note that you can't do the simple increment with a probability model; in the worst case, you'd have to recompute all the probabilities in the model.]
If you want to delve deeper into this kind of algorithms, I highly suggest you read (the first chapters of) Speech and Language Processing by Jurafsky and Martin. It treats discrete probability for language processing in quite some detail.

- 355,277
- 75
- 744
- 836
-
Although a straightforward approach, this solution doesn't take into account information from n-gram language models of the corpus. i.e. word history – swami Aug 05 '13 at 13:18
-
@swami: that's right, but is that a problem? Frequencies can be weighted if required, perhaps using an exponential scheme, so that the user's typing will outweigh the corpus or vice versa. – Fred Foo Aug 05 '13 at 13:40
-
1> "then generate all the suffixes from the point you've reached and sort them by frequency". I think there must be some important optimisations. It is impossible to traverse an entire trie given just one (or 2, or even 3) first characters. I suppose there should be a pre-computation phase... – DimanNe Jul 07 '20 at 18:04
Peter norvig had an article How to Write a Spelling Corrector that explains how Google's Did you mean...? feature works that uses Bayesian inference to make it effective. It is a very good read, and should be adaptable to an autocomplete feature.

- 5,411
- 22
- 38
-
Using Bayes' law will be overkill for autocompletion, though, since the ; just giving the most common autocompletion of a partial string is often good enough. – Fred Foo Jul 12 '12 at 09:51
-
1@larsmans True, it maybe overkill. But finding words matching the autocomplete text and sorting them according to probability just seems _so_ simple ;) – Wernsey Jul 12 '12 at 09:57
-
-
2#4 of the improvements points me in the right direction: Using n-grams for predictions. – chiborg Jul 12 '12 at 11:13