Why is reporting the log perplexity of an LDA model so slow in Spark mllib?

Question

I am fitting a LDA model in Spark mllib, using the OnlineLDAOptimizer. It only takes ~200 seconds to fit 10 topics on 9M documents (tweets).

val numTopics=10
val lda = new LDA()
  .setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(math.min(1.0, mbf)))
  .setK(numTopics)
  .setMaxIterations(2)
  .setDocConcentration(-1) // use default symmetric document-topic prior
  .setTopicConcentration(-1) // use default symmetric topic-word prior
val startTime = System.nanoTime()
val ldaModel = lda.run(countVectors)

/**
 * Print results
 */
// Print training time
println(s"Finished training LDA model.  Summary:")
println(s"Training time (sec)\t$elapsed")
println(s"==========")

numTopics: Int = 10
lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@72678a91
startTime: Long = 11889875112618
ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel@351e2b4c
Finished training LDA model.  Summary:
Training time (sec) 202.640775542

However when I request the log perplexity of this model (looks like I need to cast it back to LocalLDAModel first), it takes a very long time to evaluate. Why? (I'm trying to get the log perplexity out so I can optimize k, the # of topics).

ldaModel.asInstanceOf[LocalLDAModel].logPerplexity(countVectors)
res95: Double = 7.006006572908673
Took 1212 seconds.

score 1 · Answer 1 · edited Apr 13 '17 at 12:44

In general, calculating the perplexity is not a straightforward matter: https://stats.stackexchange.com/questions/18167/how-to-calculate-perplexity-of-a-holdout-with-latent-dirichlet-allocation
Also setting the number of topics by only looking at perplexity might not be the right approach: https://www.quora.com/What-are-good-ways-of-evaluating-the-topics-generated-by-running-LDA-on-a-corpus

LDAModels learned with the online optimizer are of type LocalLDAModel anyways, so there is no conversion happening. I calculated perplexity on both, local and distributed: they take quite some time. I mean looking at the code, they have nested map calls on the whole Dataset.

Calling:

docBound += count * LDAUtils.logSumExp(Elogthetad + localElogbeta(idx, ::).t)

for (9M * nonzero BOW entries) times can take quite some time. The Code is from: https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala line 312

Training LDA is fast in your case because you train for just 2 iterations with 9m/mbf update calls.

Btw. the default for docConcentration is Vectors.dense(-1) and not just an Int.

Btw. number 2: Thanks for this question, I had trouble with my algorithm running it on a cluster, just because I had this stupid perplexity calculation in it and din't know it causes so much trouble.

Why is reporting the log perplexity of an LDA model so slow in Spark mllib?

1 Answers1