13

Simple word autocomplete just displays a list of words that match the characters that were already typed. But I would like to order the words in the autocomplete list according to the probability of the words occuring, depending on the words that were typed before, relying on a statistical model of a text corpus. What algorithms and data structures do I need for this? Can you give me links for good tutorials?

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
chiborg
  • 26,978
  • 14
  • 97
  • 115

2 Answers2

14

You don't need probability for autocompletion. Instead, build a prefix tree (aka a trie) with the words in the corpus as keys and their frequencies as values. When you encounter a partial string, walk the trie as far as you can, then generate all the suffixes from the point you've reached and sort them by frequency.

When a user enters a previously unseen string, just add it to the trie with frequency one; when a user enters a string that you had seen (perhaps by selecting it from the candidate list), increment its frequency.

[Note that you can't do the simple increment with a probability model; in the worst case, you'd have to recompute all the probabilities in the model.]

If you want to delve deeper into this kind of algorithms, I highly suggest you read (the first chapters of) Speech and Language Processing by Jurafsky and Martin. It treats discrete probability for language processing in quite some detail.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Although a straightforward approach, this solution doesn't take into account information from n-gram language models of the corpus. i.e. word history – swami Aug 05 '13 at 13:18
  • @swami: that's right, but is that a problem? Frequencies can be weighted if required, perhaps using an exponential scheme, so that the user's typing will outweigh the corpus or vice versa. – Fred Foo Aug 05 '13 at 13:40
  • 1
    > "then generate all the suffixes from the point you've reached and sort them by frequency". I think there must be some important optimisations. It is impossible to traverse an entire trie given just one (or 2, or even 3) first characters. I suppose there should be a pre-computation phase... – DimanNe Jul 07 '20 at 18:04
6

Peter norvig had an article How to Write a Spelling Corrector that explains how Google's Did you mean...? feature works that uses Bayesian inference to make it effective. It is a very good read, and should be adaptable to an autocomplete feature.

Wernsey
  • 5,411
  • 22
  • 38
  • Using Bayes' law will be overkill for autocompletion, though, since the ; just giving the most common autocompletion of a partial string is often good enough. – Fred Foo Jul 12 '12 at 09:51
  • 1
    @larsmans True, it maybe overkill. But finding words matching the autocomplete text and sorting them according to probability just seems _so_ simple ;) – Wernsey Jul 12 '12 at 09:57
  • sorting them by frequency is much simpler. – Fred Foo Jul 12 '12 at 10:01
  • 2
    #4 of the improvements points me in the right direction: Using n-grams for predictions. – chiborg Jul 12 '12 at 11:13