3

I have worked with Lucene for indexing documents and providing search among them, however, my work was in English language, but now, I have a project which is Kurdish language, Kurdish language uses some Arabic unicode characters and several other characters, here is Table of Unicode Characters used in Kurdish-Arabic script

My question is how to create Analyzer for this language, or can I use Arabic Analyzer for this purpose?

solidfox
  • 581
  • 1
  • 6
  • 20

2 Answers2

1

Lucene has a list of other analyzers, including Arabic. I'm afraid there's no one which targets specifically Kurdish, but maybe you can extend Arabic analyzer to fit your needs?

Just bear in mind that all these analyzers come separately from the main Lucene distribution.

mindas
  • 26,463
  • 15
  • 97
  • 154
  • I have already customize a PersianAnalyzer which is more relevant to the Kurdish language than ArabicAnalyzer by providing new stopwords list and changing normalization class , however, stemming is another issue. any suggestion pls ? – solidfox Dec 27 '12 at 20:31
1

To answer your question about howto create a custom Analyzer for a new language..."Lucene In Action" book covers the creation of custom analyzers and it is pretty detailed. You can "leverage" a lot of the code found in other analyzers and just change what you need. Lucene is open source and very extensible, therefore profiling these changes is pretty easy.

Bart Czernicki
  • 3,663
  • 1
  • 21
  • 19
  • I have already customize a PersianAnalyzer for this purpose by providing new stopwords list and changing normalization class , however, stemming is another issue. any suggestion pls ? – solidfox Dec 27 '12 at 20:18