1

I am trying to implement a semantic search engine with deep NLP pipeline using Whoosh. Currently, I just have stemming analyzer, but I need to add lemmatizing and pos tagging to my analyzers.

 schema = Schema(id=ID(stored=True, unique=True), stem_text=TEXT(stored= True, analyzer=StemmingAnalyzer()))

I want to know how to add custom analyzers to my schema.

Shruti h
  • 37
  • 8

1 Answers1

1

You can write a custom lemmatization filter and integrate into an existing whoosh analyzer. Quoting from Whoosh docs:

Whoosh does not include any lemmatization functions, but if you have separate lemmatizing code you could write a custom whoosh.analysis.Filter to integrate it into a Whoosh analyzer.

You can create an analyzer by combining a tokenizer with filters:

my_analyzer = RegexTokenizer() | LowercaseFilter() | StopFilter() | LemmatizationFilter()

or by adding a filter to an existing analyzer:

my_analyzer = StandardAnalyzer() | LemmatizationFilter()

You can define a filter like:

def LemmatizationFilter(self, stream):
    for token in stream:
        yield token
Assem
  • 11,574
  • 5
  • 59
  • 97
  • This is the analyser I am using now. `my_analyzer = RegexTokenizer()| StopFilter()| LowercaseFilter() | StemFilter() | Lemmatizer() ` In this analyser will the lemmatizer overrides stem filter or both stemming and lemmatizing is applied to my indexing – Shruti h Nov 20 '17 at 20:21