-1

I have a corpora of ~1,400 documents. I did all text cleansing using tm package. My last step was creation of the DTM matrix. I am trying to train the LDA model based on 200 documents examined by human and topics(categories) that were assigned. Unfortunately, I can't share the reproducible example.

Can someone help how is this performed with one of the freely available data sets as an example?

Sir Oliver
  • 57
  • 8
  • As far as I am aware, LDA is a unsupervised machine learning algorithm. So, the model does not need to be trained in order to produce outputs. The algorithms looks for structures latent in the corpus to produce topic-word allocations. There are supervised versions of LDA, like the one here: https://www.cs.princeton.edu/~blei/papers/BleiMcAuliffe2007.pdf, but I do not think that they are implemented in the topic-modelling package – DotPi Oct 12 '16 at 20:46
  • You are right. When I took LDA approach, optimal number of topics is 2-3. That is also showed by the elbow curve for k means clustering. That is much less than when I did human examination. What are my alternatives? – Sir Oliver Oct 12 '16 at 20:49

1 Answers1

0

If you have annotated training data, why don't you use supervised classification techniques like SVM or logistic regression which are pretty good for text classification tasks. Scikit-learn in python has all the implementation for these classifiers and you can directly use them for the classification purpose.

Wasi Ahmad
  • 35,739
  • 32
  • 114
  • 161