Hierarchical LDA eats up all available memory and never finishes

Question

I am waiting for my membership on the mailing list to be confirmed, so I thought I would ask it here to maybe speed up the things a little bit.

I am writing my master's thesis on topic modeling and use Mallet implementations of LDA and HLDA.

I work on a corpus of over 4m documents. While LDA (ParallelTopicModel) handles the dataset decently, and I don't encounter any issues with that, HLDA is unable to go farther then let's say 5-6 iterations before filling up all the available memory (I even ran the program with 90g of RAM). On smaller datasets (10-20k documents) it works.

That's how I train the model:

HierarchicalLDA hierarchicalLDAModel = new HierarchicalLDA();
hierarchicalLDAModel.initialize(trainInstances, testInstances, numLevels, new Randoms());
hierarchicalLDAModel.estimate(numIterations);

I'd gladly provide any other information you might need for troubleshooting, just comment and let me know.

Thank you very much in advance!

The mailing list is deprecated, SO is a much better forum for this type of question. — David Mimno, Dec 20 '16 at 14:33

score 1 · Answer 1 · answered Dec 20 '16 at 14:33

1

hLDA is a non-parametric model, which means that the number of parameters expands with the data size. There's currently no way to apply a maximum number of topics. You can most effectively limit the number of topics by increasing the topic-word smoothing parameter eta (NOT the CRP parameters). If this parameter is small, the model will prefer to create a new topic rather than add a low-probability word to an existing topic.

answered Dec 20 '16 at 14:33

David Mimno

1,836
7
7

thank you for the answer professor! to clarify that I understood you correctly -- the only way that could possibly influence the hunger for memory is tweaking eta? I will try to run the experiments and will report on the outcome – wojtuch Dec 24 '16 at 13:55
unfortunately trying out different values for eta doesn't help in my case -- every time the program crashes after a couple of hours and several (3-6) iterations – wojtuch Jan 10 '17 at 16:43

Hierarchical LDA eats up all available memory and never finishes

1 Answers1