Mahout LDA: what is the largest dictionary size that can practically be used?

Question

I am running Mahout's LDA on EC2 (using Whirr). What is the largest vocabulary that you have been able to use in practice? Could you share some Hadoop/EC2 settings?

Ideally, I would like to run LDA on a corpus of 3M documents (1B tokens), with a dictionary of 20M tokens.

I have tried other map-reduce implementations of LDA (hadoop-lda, Mr. LDA), and did not manage to scale it very far (please prove me wrong!)

score 0 · Answer 1 · answered Dec 08 '12 at 10:56

the best place for such questions is the Mahout mailinglist [1]. I haven't tried the LDA implementation myself, but it has been contributed by twitter, so my guess is it should fit your scale needs.

I'm sure the people on the mailinglist can give you a better answer though.

[1] https://cwiki.apache.org/confluence/display/MAHOUT/Mailing+Lists,+IRC+and+Archives

Mahout LDA: what is the largest dictionary size that can practically be used?

1 Answers1