I am running Mahout's LDA on EC2 (using Whirr). What is the largest vocabulary that you have been able to use in practice? Could you share some Hadoop/EC2 settings?
Ideally, I would like to run LDA on a corpus of 3M documents (1B tokens), with a dictionary of 20M tokens.
I have tried other map-reduce implementations of LDA (hadoop-lda, Mr. LDA), and did not manage to scale it very far (please prove me wrong!)