Creating custom analyzers using whoosh

Question

I am trying to implement a semantic search engine with deep NLP pipeline using Whoosh. Currently, I just have stemming analyzer, but I need to add lemmatizing and pos tagging to my analyzers.

 schema = Schema(id=ID(stored=True, unique=True), stem_text=TEXT(stored= True, analyzer=StemmingAnalyzer()))

I want to know how to add custom analyzers to my schema.

score 1 · Accepted Answer · answered Nov 18 '17 at 12:52

You can write a custom lemmatization filter and integrate into an existing whoosh analyzer. Quoting from Whoosh docs:

Whoosh does not include any lemmatization functions, but if you have separate lemmatizing code you could write a custom whoosh.analysis.Filter to integrate it into a Whoosh analyzer.

You can create an analyzer by combining a tokenizer with filters:

my_analyzer = RegexTokenizer() | LowercaseFilter() | StopFilter() | LemmatizationFilter()

or by adding a filter to an existing analyzer:

my_analyzer = StandardAnalyzer() | LemmatizationFilter()

You can define a filter like:

def LemmatizationFilter(self, stream):
    for token in stream:
        yield token

This is the analyser I am using now. `my_analyzer = RegexTokenizer()| StopFilter()| LowercaseFilter() | StemFilter() | Lemmatizer() ` In this analyser will the lemmatizer overrides stem filter or both stemming and lemmatizing is applied to my indexing — Shruti h, Nov 20 '17 at 20:21

Creating custom analyzers using whoosh

1 Answers1