0

I'm using mahout to do topic discovery using LDA. To prepare my data I use seq2sparse which tokenize the document and creates n-grams. However it does not support word stemming by default. I wonder to know is Mahout has any built-in word stemming? If not, should I implement my own? Any recommendation?

HHH
  • 6,085
  • 20
  • 92
  • 164

1 Answers1

0

You can precise your analyzer with the seq2sparse command :

$MAHOUT_HOME/bin/mahout seq2sparse
             ...
             --analyzerName (-a) analyzerName  The class name of the analyzer 

The analyzer is an Apache Lucene analyzer, so you'll have to precise the name as followed, per example :

org.apache.lucene.analysis.fr.FrenchAnalyzer

I suggest that you read the official documentation for more information about what you can do with the seqsparse command. You'll also need to read some Lucene documentation.

PS: You should use the same lucene version as in mahout.

eliasah
  • 39,588
  • 11
  • 124
  • 154
  • Thanks, So I looked into Lucence and it looks like that there are different stemming algorithms, i.e. ``EnglishMinimalStemmer``, ``EnglishStemmer``. Do you know which one is better? The other question I have is how can I find out the version of my lucene version ? – HHH May 06 '15 at 17:13
  • The stemming algorithms should respond to your needs and use case. I can't say which is better. You have to evaluate both and see which one fits better to your model. – eliasah May 06 '15 at 17:39
  • You can find Lucene's version in Mahout documentation or even for a closer look. You can't try the pom.xml in mahout source code. – eliasah May 06 '15 at 17:41
  • I tried to pass a stemmer as the analyzer but it gives me an error message. IT looks like only the EnglsishAnalyzer class (or similar ones like FrenchAnalyze) can be used not a stemmer? – HHH May 06 '15 at 19:17
  • What version of mahout are you using? – eliasah May 06 '15 at 19:18
  • The latest one which comes with Hortonworks 2.2 (mahout-examples-0.9.0.2.2.0.0-2041-job) – HHH May 06 '15 at 19:21
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/77122/discussion-between-h-z-and-eliasah). – HHH May 06 '15 at 19:25