0

I have a question regarding lucene Stemmer. I was wondering if lucene keeps both stemmed words and non-stemmed words OR just replaces the stemmed word with the non-stemmed words?

for example if a record has following: "everyone loves cats" does it going to be indexed as "everyone loves love cats cat" OR "everyone love cat"

Does it have a same strategy for both query and records?

Mr.Boy
  • 615
  • 1
  • 7
  • 13
  • Not a direct answer, but in my experience keeping both is a good strategy if you want to improve recall. – Fred Foo Jun 20 '13 at 20:55

1 Answers1

0

Generally, only the Stemmed version is kept. That is, in your example, the end result will be "everyone loves cat" rather than "everyone loves cat cats" or some similar combination.

You are expected to use the same stemmer both when indexing and querying. There may be some stemming filters that, like SynonymFilter, allow you to keep the original, but doing this and running unstemmed queries will tend to cause PhraseQueries not to work correctly (see the note in the SynonymFilter docs on this very topic). I don't believe most common stemming filters (ie. PorterStemFilter) provide that functionality.

I you need to be able to search unstemmed data for some reason, I would recommend storing a second field that is entirely unstemmed for that purpose.

femtoRgon
  • 32,893
  • 7
  • 60
  • 87
  • Thank you very much. I got what you said. So, if we want to use couple of filters, is it a good sequence to follow: LowerCaseFilter -> SynonymFilter -> StopWordsFilter -> StemmerFilter ? – Mr.Boy Jun 20 '13 at 20:53
  • I would probably place a `SynonymFilter` at the end, after stemming, and be sure the synonyms were defined using stemmed words (or possibly run them through the same stemmer as you construct the `SynonymFilter`). Seems fine to me though. `StandardTokenizer` -> `StandardFilter` -> `LowerCaseFilter` -> `StopWordsFilter` -> linguistic-StemmerFilter-of-some-kind forms a pretty reasonable baseline pattern for a language-specific Analyzer. – femtoRgon Jun 20 '13 at 21:49