14

i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document.

zero323
  • 322,348
  • 103
  • 959
  • 935
Rami
  • 8,044
  • 18
  • 66
  • 108

1 Answers1

14

As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you're going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this:

newDocuments: RDD[(Long, Vector)] = ...
val topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)

This is going to be less accurate than the EM algorithm that this paper suggests, but it will work. Alternatively, you could just use the new online variational EM training algorithm which already results in a LocalLDAModel. In addition to being faster, this new algorithm is also preferable due to the fact that it, unlike the older EM algorithm for fitting DistributedLDAModels, is optimizing the parameters (alphas) of the Dirichlet prior over the topic mixing weights for the documents. According to Wallach, et. al., optimization of the alphas is pretty important for obtaining good topics.

zero323
  • 322,348
  • 103
  • 959
  • 935
Jason Scott Lenderman
  • 1,908
  • 11
  • 14
  • 1
    thanks. the answer is very useful! If possible, could you hep to elaborate more in how to extract the output of topicDistributions to a more representable results? – HappyCoding Feb 18 '16 at 01:57
  • I've implemented this and shown how to print the topicDistributions [here](https://gist.github.com/alex9311/774089d936eee505d7832c6df2eb597d) – alex9311 Jun 08 '16 at 09:05
  • Has anything changed for 1.6? – Evan Zamir Jul 08 '16 at 05:11
  • Is distLDA.toLocal.XXX working at all for python, or just scala? Is only topicDistributions working or are all the other functions working too? – Geoffrey Anderson May 09 '17 at 14:53
  • @alex9311 Despite your code which perhaps proves me wrong, the apache docs v 2.1.0 actually say topicDistribution is missing from the localLDAModel, which makes your code quite interesting! I cannot explain that. "Local LDA model. This model stores only the inferred topics." and "Distributed LDA model. This model stores the inferred topics, the full training dataset, and the topic distributions." – Geoffrey Anderson May 09 '17 at 15:11
  • I am getting this toLocal and topicDistributions attributes not available in ldamodel. I am implementing the model in spark 2.1.1 using pyspak. Can you guys please have a look at my question [question] (https://stackoverflow.com/questions/55808041/spark-2-1-1-how-to-predict-topics-in-unseen-documents-on-already-trained-lda-mo/55820586#55820586) – Usman Khan Apr 24 '19 at 04:31