I'm using mahout to do topic discovery using LDA. To prepare my data I use seq2sparse
which tokenize the document and creates n-grams. However it does not support word stemming by default. I wonder to know is Mahout has any built-in word stemming? If not, should I implement my own? Any recommendation?
Asked
Active
Viewed 223 times
0

HHH
- 6,085
- 20
- 92
- 164
1 Answers
0
You can precise your analyzer with the seq2sparse
command :
$MAHOUT_HOME/bin/mahout seq2sparse
...
--analyzerName (-a) analyzerName The class name of the analyzer
The analyzer is an Apache Lucene analyzer, so you'll have to precise the name as followed, per example :
org.apache.lucene.analysis.fr.FrenchAnalyzer
I suggest that you read the official documentation for more information about what you can do with the seqsparse
command. You'll also need to read some Lucene documentation.
PS: You should use the same lucene version as in mahout.

eliasah
- 39,588
- 11
- 124
- 154
-
Thanks, So I looked into Lucence and it looks like that there are different stemming algorithms, i.e. ``EnglishMinimalStemmer``, ``EnglishStemmer``. Do you know which one is better? The other question I have is how can I find out the version of my lucene version ? – HHH May 06 '15 at 17:13
-
The stemming algorithms should respond to your needs and use case. I can't say which is better. You have to evaluate both and see which one fits better to your model. – eliasah May 06 '15 at 17:39
-
You can find Lucene's version in Mahout documentation or even for a closer look. You can't try the pom.xml in mahout source code. – eliasah May 06 '15 at 17:41
-
I tried to pass a stemmer as the analyzer but it gives me an error message. IT looks like only the EnglsishAnalyzer class (or similar ones like FrenchAnalyze) can be used not a stemmer? – HHH May 06 '15 at 19:17
-
What version of mahout are you using? – eliasah May 06 '15 at 19:18
-
The latest one which comes with Hortonworks 2.2 (mahout-examples-0.9.0.2.2.0.0-2041-job) – HHH May 06 '15 at 19:21
-
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/77122/discussion-between-h-z-and-eliasah). – HHH May 06 '15 at 19:25