0

I am implementing keyword based search project. Thus, during the processing the input, the program must extract key words in given ways:

  1. ignore punctuation marks (i.e .!?, etc.)
  2. ignore binding words (i.e and, or, so etc.)
  3. last and important task is find the root of word, for example communiti or communities must change to community.

I used http://snowball.tartarus.org/ but it was not working properly.

Rauf Aghayev
  • 300
  • 1
  • 12
  • What do you mean by "it was not working properly"? Snowball is a language to write stemmers, so you'd have to write a stemmer yourself using it. What did you try, what do you actually need? – lenz May 22 '15 at 08:29
  • Some hints: the task your mentioning are canonically called (1) tokenisation, (2) stopword removal and (3) lemmatisation (although stemming might be sufficient), maybe also spelling normalisation. This should help you find some tools – if you're working with English texts (which you implie through the example) then there should be resources readily available in all major programming languages. – lenz May 22 '15 at 08:37
  • For example _Snowball_ changed `communities` to `communiti` or `false` to `fals`. But they are wrong. So, what I need is to find the real word from given word. For example in given cases `communities` mus be changed to `community` and `false` remain the same. – Rauf Aghayev May 22 '15 at 08:37
  • Okay: stemming produces some kind of search key. It doesn't have to be a proper word. But that's okay in a search index, because both text and query are treated the same way – they all map to the same key (which just as well could be only a number rather than a string). If you need proper words, you need to perform **lemmatisation**, not stemming. – lenz May 22 '15 at 08:40
  • Thanks for your explanation, it is clear now. In my case I should use **lemmatization** not stemming. – Rauf Aghayev May 22 '15 at 08:42

0 Answers0