How to implement a good search system for a set of html pages generated with Mkdocs?

Question

I'm using Mkdocs for creating articles (sets of static HTML pages). The problem with these docs is that the search system created by Mkdocs is very basic, retrieving articles pretty randomly, based on their mere presence in an article's text, and no coherent phrases matching is possible in any way, no "A B C" strict match searching either.

Some examples of how badly the search works presently:
When you enter "do not select auto-filling", the search will not bring up the 3 articles which actually contain the phrase "Do not select "Auto-filling" by default", but instead bring up all articles containing do, in, not, select, auto, filling + their variations.

When you enter a short word, for example "while" in the search field, no results are retrieved, even though the word while is present in a dozen of articles. Another example: when you enter "selector window", no articles containing the phrase "time selector window" are brought up to the top of search results; instead, all articles containing the word "window" are retrieved.

Could anyone Mkdocs-savvy advise with this, please?

What's in my Mkdocs.yml:

markdown_extensions:  
    - smarty  
    - toc:  
        permalink: True  
        separator: "_"  
    - sane_lists  
    - tables  
    - meta  
    - fenced_code  
    - admonition  
    - footnotes  
plugins:  
    - search  
extra:  
    version: 1.0  
    search:  
      tokenizer: '[\\s\\-\\.]+'

{{{ ^ this search tokenizer is absolutely ignored for some reason. If it's removed, search works as badly :) }}}

What am I missing?

Essentially you would need to recreate your own implementation of a JS search library. Specifically a clone of [lunr.js](https://github.com/olivernn/lunr.js) which works to your liking. How to create an entire search library is not an on-topic question here. That said, I see a few minor adjustments you can make which might help the existing search solution work better for you, which I will address in an answer. — Waylan, Jan 21 '20 at 19:25

score 2 · Answer 1 · answered Jan 21 '20 at 20:13

First of all, as your mkdocs.yml file does not specify a theme, it is assumed that you are using the default theme, which uses the default search implementation. Note that some other themes (especially material) implement their own search solution which is different than the default. This answer does not apply to those themes.

The search tokenizer setting is being ignored because you are defining it incorrectly. As documented, the setting is named separator not tokenizer and it needs to be defined as a sub-section of the search plugin. Like this:

plugins:
    - search:
        separator: '[\s\-\.]+'

Regarding the search terms, note that MkDocs uses [lunr.js] as its search engine. Lunr.js documents how the end user can modify the search in various ways.

By the way, your search for auto-filling will not match as you expect because the hyphen (-) is a separator character. In other words, when the search index is created, the hyphen is treated the same as a space and the words auto and filling are indexed as two separate words. If you don't want that behavior, you need to remove the hyphen from your setting. But that is probably not what you want.

The default is to use an OR search. If any one of the terms (each term being separated by any one of the separator characters) exists within a document, then that document is returned as a search result. If multiple terms exist within a document, then that document is ranked higher. However, an OR search does not consider the terms in relation to each other within the document.

You might find an AND search to be more effective. Simply prepend an + to each term (+do +not +select +auto +filling) and then you will only get results which contain all of the terms. Notice that I also left the hyphen out of the search terms as it is a separator as explained above.

However, while that will only return results which contain all of the terms, it does not favor results which contain the terms grouped together in that specific order. A common solution which search engines employ is to require terms enclosed in quotes to match the specific order. However, as per livernn/lunr.js#62, lunr.js does not support that feature at this time.

Additionally, the search engine ignores stop words. Specifically, some words are so common that they are ignored completely by the search engine. For example, words like the or a occur multiple times in every English language document. Therefore, the search engine ignores them.

Then there is the issue of stemming, which is explained in lunr.js' documentation:

Stemming is the process of reducing inflected or derived words to their base or stem form. For example, the stem of “searching”, “searched” and “searchable” should be “search”. This has two benefits: firstly the number of tokens in the search index, and therefore its size, is significantly reduced, and in addition, it increases the recall when performing a search. A document containing the word “searching” is likely to be relevant to a query for “search”.

Given the above, you will probably find that the search for select auto fill will most likely return the exact same results as do not select auto-filling. However, using +filling should help as it forces an exact match for the term filling rather than the stem word fill.

Finally, you ask...

How to implement a good search system

Note that such a question is too broad and off-topic here. However, the lunr.js documentation linked to above provides a nice summary of many of the basic concepts used by most search engines. While you would likely make some different choices in your implementation (as would I), the basic concepts should give you a starting point for terms to search in your research if you really are interested in creating an entire search engine of your own.

Thank you so much, Waylan. Indeed, I use the material theme... how to make search work in it? — CyberHead, Jan 22 '20 at 13:13
As I understand it, `material` is in the process of completely rewriting their search implementation, which is a heavily modified lunr.js. It may be that their new release will address your concerns. Or maybe not. I would suggest reaching out to the developers of the theme to express your concerns now before they do their release (unless they already have, I haven't followed them that closely). — Waylan, Jan 22 '20 at 18:10

How to implement a good search system for a set of html pages generated with Mkdocs?

1 Answers1