6

Currently I use 'lucene' and 'elasticsearch', and have next problem. I need get stemmed form or lemma for diminutive word. For instance :

  • doggy -> dog
  • kitty -> cat

etc.

But I get next results :

  • doggy -> doggi
  • kitty -> kitti

Is there any way (not important ready to use library, any algorithm, approach etc.) to get root / original word form for diminutive word forms?

Target language : Russian. For example :

  • собачка -> собака
  • кошечка -> кошка

Thanks in advance!

Ivan Kurchenko
  • 4,043
  • 1
  • 11
  • 28
  • What kind of chain have you used for English stemming? I would be surprised you got this by using `PorterStemFilter`. – mindas Sep 09 '14 at 14:52
  • You cannot (and should not) get *cat* from *kitty* using stemming or lemmatization: "cat" is neither the lemma nor the stem of "kitty". – Chthonic Project Nov 05 '14 at 21:22

1 Answers1

3

Firstly, as a side note: What you're trying to do isn't typically called stemming or lemmatiziation.

Your first issue would be mapping the token observed (e.g. собачка) to its normalised form (e.g. собака)-- Naively, this could be done by creating a SynonymFilter which uses a SynonymMap mapping dimunitive forms to their canonical forms. However, you'll likely run into problems with any natural language because not all derivations are unambiguous: For example, in German, Mädel ('girl'/'lass') could be a diminutive form of Magd (an archaic word meaning 'young woman'/'maid') or of Made ('maggot').

One way of disambiguating these two forms would be to calculate the probability of each canonical form appearing in the given context (e.g. the history of the preceding n tokens) and then replacing the dimunitive form with the most probable canonical form (using a custom-made TokenFilter to do so)-- See e.g. the Wikipedia entry for word-sense disambiguation for different approaches.

errantlinguist
  • 3,658
  • 4
  • 18
  • 41