LDA Topic assignment

Question

I have a corpora of ~1,400 documents. I did all text cleansing using tm package. My last step was creation of the DTM matrix. I am trying to train the LDA model based on 200 documents examined by human and topics(categories) that were assigned. Unfortunately, I can't share the reproducible example.

Can someone help how is this performed with one of the freely available data sets as an example?

As far as I am aware, LDA is a unsupervised machine learning algorithm. So, the model does not need to be trained in order to produce outputs. The algorithms looks for structures latent in the corpus to produce topic-word allocations. There are supervised versions of LDA, like the one here: https://www.cs.princeton.edu/~blei/papers/BleiMcAuliffe2007.pdf, but I do not think that they are implemented in the topic-modelling package — DotPi, Oct 12 '16 at 20:46
You are right. When I took LDA approach, optimal number of topics is 2-3. That is also showed by the elbow curve for k means clustering. That is much less than when I did human examination. What are my alternatives? — Sir Oliver, Oct 12 '16 at 20:49

score 0 · Answer 1 · answered Oct 24 '16 at 08:19

If you have annotated training data, why don't you use supervised classification techniques like SVM or logistic regression which are pretty good for text classification tasks. Scikit-learn in python has all the implementation for these classifiers and you can directly use them for the classification purpose.

LDA Topic assignment

1 Answers1