2

I am waiting for my membership on the mailing list to be confirmed, so I thought I would ask it here to maybe speed up the things a little bit.

I am writing my master's thesis on topic modeling and use Mallet implementations of LDA and HLDA.

I work on a corpus of over 4m documents. While LDA (ParallelTopicModel) handles the dataset decently, and I don't encounter any issues with that, HLDA is unable to go farther then let's say 5-6 iterations before filling up all the available memory (I even ran the program with 90g of RAM). On smaller datasets (10-20k documents) it works.

That's how I train the model:

HierarchicalLDA hierarchicalLDAModel = new HierarchicalLDA();
hierarchicalLDAModel.initialize(trainInstances, testInstances, numLevels, new Randoms());
hierarchicalLDAModel.estimate(numIterations);

I'd gladly provide any other information you might need for troubleshooting, just comment and let me know.

Thank you very much in advance!

GileBrt
  • 1,830
  • 3
  • 20
  • 28
wojtuch
  • 188
  • 2
  • 11

1 Answers1

1

hLDA is a non-parametric model, which means that the number of parameters expands with the data size. There's currently no way to apply a maximum number of topics. You can most effectively limit the number of topics by increasing the topic-word smoothing parameter eta (NOT the CRP parameters). If this parameter is small, the model will prefer to create a new topic rather than add a low-probability word to an existing topic.

David Mimno
  • 1,836
  • 7
  • 7
  • thank you for the answer professor! to clarify that I understood you correctly -- the only way that could possibly influence the hunger for memory is tweaking eta? I will try to run the experiments and will report on the outcome – wojtuch Dec 24 '16 at 13:55
  • unfortunately trying out different values for eta doesn't help in my case -- every time the program crashes after a couple of hours and several (3-6) iterations – wojtuch Jan 10 '17 at 16:43