The FTS utilizes dictionaries to normalize the text:
12.6. Dictionaries
Dictionaries are used to eliminate words that should not be considered in a search (stop words), and to normalize words so that different derived forms of the same word will match. A successfully normalized word is called a lexeme.
So dictionaries are used to throw out things that are too common or meaningless to consider in a search (stop words) and to normalize everything else so city and cities, for example, will match even though they're different words.
Let us look at some output from ts_debug
and see what's going on with the dictionaries:
=> select * from ts_debug('english', 'mortgage');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+----------------+--------------+-----------
asciiword | Word, all ASCII | mortgage | {english_stem} | english_stem | {mortgag}
=> select * from ts_debug('simple', 'mortgage');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+----------+--------------+------------+------------
asciiword | Word, all ASCII | mortgage | {simple} | simple | {mortgage}
Notice that simple
uses the simple
dictionary whereas english
uses the english_stem
dictionary.
The simple
dictionary:
operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme.
The simple
dictionary just throws out stop words, downcases, and that's about it. We can see its simplicity ourselves:
=> select to_tsquery('simple', 'Mortgage'), to_tsquery('simple', 'Mortgages');
to_tsquery | to_tsquery
------------+-------------
'mortgage' | 'mortgages'
The simple
dictionary is too simple to even handle simple plurals.
So what is this english_stem
dictionary all about? The "stem" suffix is a give away: this dictionary applies a stemming algorithm to words to convert (for example) city and cities to the same thing. From the fine manual:
12.6.6. Snowball Dictionary
The Snowball dictionary template is based on a project by Martin Porter, inventor of the popular Porter's stemming algorithm for the English language. [...] Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language.
And just below that we see the english_stem
dictionary:
CREATE TEXT SEARCH DICTIONARY english_stem (
TEMPLATE = snowball,
Language = english,
StopWords = english
);
So the english_stem
dictionary stems words and we can see that happen:
=> select to_tsquery('english', 'Mortgage'), to_tsquery('english', 'Mortgages');
to_tsquery | to_tsquery
------------+------------
'mortgag' | 'mortgag'
Executive Summary: 'simple'
implies simple minded literal matching, 'english'
applies stemming to (hopefully) produce better matching. The stemming turns mortgage into mortgag and that gives you your match.