Algorithms/theory behind predictive autocomplete?

Question

Simple word autocomplete just displays a list of words that match the characters that were already typed. But I would like to order the words in the autocomplete list according to the probability of the words occuring, depending on the words that were typed before, relying on a statistical model of a text corpus. What algorithms and data structures do I need for this? Can you give me links for good tutorials?

Fred Foo · Answer 1 · 2012-07-12T10:35:01.580

14

You don't need probability for autocompletion. Instead, build a prefix tree (aka a trie) with the words in the corpus as keys and their frequencies as values. When you encounter a partial string, walk the trie as far as you can, then generate all the suffixes from the point you've reached and sort them by frequency.

When a user enters a previously unseen string, just add it to the trie with frequency one; when a user enters a string that you had seen (perhaps by selecting it from the candidate list), increment its frequency.

[Note that you can't do the simple increment with a probability model; in the worst case, you'd have to recompute all the probabilities in the model.]

If you want to delve deeper into this kind of algorithms, I highly suggest you read (the first chapters of) Speech and Language Processing by Jurafsky and Martin. It treats discrete probability for language processing in quite some detail.

edited Jul 12 '12 at 10:35

answered Jul 12 '12 at 10:03

Fred Foo

355,277
75
744
836

Although a straightforward approach, this solution doesn't take into account information from n-gram language models of the corpus. i.e. word history – swami Aug 05 '13 at 13:18
@swami: that's right, but is that a problem? Frequencies can be weighted if required, perhaps using an exponential scheme, so that the user's typing will outweigh the corpus or vice versa. – Fred Foo Aug 05 '13 at 13:40
1

> "then generate all the suffixes from the point you've reached and sort them by frequency". I think there must be some important optimisations. It is impossible to traverse an entire trie given just one (or 2, or even 3) first characters. I suppose there should be a pre-computation phase... – DimanNe Jul 07 '20 at 18:04

score 6 · Accepted Answer · answered Jul 12 '12 at 09:48

6

Peter norvig had an article How to Write a Spelling Corrector that explains how Google's Did you mean...? feature works that uses Bayesian inference to make it effective. It is a very good read, and should be adaptable to an autocomplete feature.

answered Jul 12 '12 at 09:48

Wernsey

5,411
22
38

Using Bayes' law will be overkill for autocompletion, though, since the ; just giving the most common autocompletion of a partial string is often good enough. – Fred Foo Jul 12 '12 at 09:51
1

@larsmans True, it maybe overkill. But finding words matching the autocomplete text and sorting them according to probability just seems _so_ simple ;) – Wernsey Jul 12 '12 at 09:57
sorting them by frequency is much simpler. – Fred Foo Jul 12 '12 at 10:01
2

#4 of the improvements points me in the right direction: Using n-grams for predictions. – chiborg Jul 12 '12 at 11:13

Algorithms/theory behind predictive autocomplete?

2 Answers2