Stemming Words in MySQL For e.g. the user might search for "testing", "tested" or "tests". All these words are related to each other because the base word "test" is common in all of them. Is there a way to get such result or function?
-
1(1) I removed the irrelevant database tags. (2) Stemming is not an appropriate operation for a relational database. – Gordon Linoff Jan 08 '18 at 05:55
1 Answers
MySQL Full-Text Search
Historically, full-text searches were supported in MyISAM engines. After version 5.6, MySQL also supported full-text searches in InnoDB storage engines. This has been great news, since it enables developers to benefit from InnoDB’s referential integrity, ability to perform transactions, and row level locks.
There are basically two approaches to full-text searches in MySQL: natural language and boolean mode. (A third option augments the natural language search with a second expansion query.)
The main difference between the natural and boolean modes is that the boolean allows certain operators as part of the search. For instance, boolean operators can be used if a word has greater relevance than others in the query or if a specific word should be present in the results, etc. It’s worth noticing that in both cases, results can be sorted by the relevance computed by MySQL during the search.
The best fit for our problem was to use InnoDb full-text searches in boolean mode. Why?
- We had little time to implement the search function.
- At this point, we had no big data to crunch nor a massive load to require something like Elasticsearch or Sphinx.
- We used shared hosting that doesn’t support Elasticsearch or Sphinx and the hardware was pretty limited at this stage.
- While we wanted word stemming in our search function, it wasn’t a deal breaker: we could implement it (within constraints) by way of some simple PHP coding and data denormalization
- Full-text searches in boolean mode can search words with wildcards (for the word stemming) and sort the results based on relevance.
In the Normalized Vertabelo Model
Let’s see how a simple search would work. We’ll create a sample table first:
CREATE TABLE artists (
id int(11) NOT NULL AUTO_INCREMENT, name varchar(255) NOT NULL,bio text NOT NULL, CONSTRAINT artists_pk PRIMARY KEY (id)
)ENGINE InnoDB;
CREATE FULLTEXT INDEX artists_idx_1 ON artists (name);
In natural language mode
You can insert some sample data and start testing. (It would be good to add it to your sample dataset.) For instance, we’ll try searching for Michael Jackson:
SELECT
*
FROM
artists
WHERE
MATCH (artists.name) AGAINST ('Michael Jackson' IN NATURAL LANGUAGE MODE)
This query will find records that match the search terms and will sort matching records by relevance; the better the match, the more relevant it is and the higher the result will appear in the list. In boolean mode
We can perform the same search in boolean mode. If we don’t apply any operators to our query, the only difference will be that results are not sorted by relevance:
SELECT
*
FROM
artists
WHERE
MATCH (artists.name) AGAINST ('Michael Jackson' IN BOOLEAN MODE)
The wildcard operator in boolean mode
Since we want to search stemmed and partial words, we will need the wildcard operator (*). This operator can be used in boolean mode searches, which is why we chose that mode.
So, let’s unleash the power of boolean search and try searching for part of the artist’s name. We’ll use the wildcard operator to match any artist whose name starts with ‘Mich’:
SELECT
*
FROM
artists
WHERE
MATCH (name) AGAINST ('Mich*' IN BOOLEAN MODE)