1

I am fitting a LDA model in Spark mllib, using the OnlineLDAOptimizer. It only takes ~200 seconds to fit 10 topics on 9M documents (tweets).

val numTopics=10
val lda = new LDA()
  .setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(math.min(1.0, mbf)))
  .setK(numTopics)
  .setMaxIterations(2)
  .setDocConcentration(-1) // use default symmetric document-topic prior
  .setTopicConcentration(-1) // use default symmetric topic-word prior
val startTime = System.nanoTime()
val ldaModel = lda.run(countVectors)

/**
 * Print results
 */
// Print training time
println(s"Finished training LDA model.  Summary:")
println(s"Training time (sec)\t$elapsed")
println(s"==========")

numTopics: Int = 10
lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@72678a91
startTime: Long = 11889875112618
ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel@351e2b4c
Finished training LDA model.  Summary:
Training time (sec) 202.640775542

However when I request the log perplexity of this model (looks like I need to cast it back to LocalLDAModel first), it takes a very long time to evaluate. Why? (I'm trying to get the log perplexity out so I can optimize k, the # of topics).

ldaModel.asInstanceOf[LocalLDAModel].logPerplexity(countVectors)
res95: Double = 7.006006572908673
Took 1212 seconds.
zero323
  • 322,348
  • 103
  • 959
  • 935

1 Answers1

1

LDAModels learned with the online optimizer are of type LocalLDAModel anyways, so there is no conversion happening. I calculated perplexity on both, local and distributed: they take quite some time. I mean looking at the code, they have nested map calls on the whole Dataset.

Calling:

docBound += count * LDAUtils.logSumExp(Elogthetad + localElogbeta(idx, ::).t)

for (9M * nonzero BOW entries) times can take quite some time. The Code is from: https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala line 312

Training LDA is fast in your case because you train for just 2 iterations with 9m/mbf update calls.

Btw. the default for docConcentration is Vectors.dense(-1) and not just an Int.

Btw. number 2: Thanks for this question, I had trouble with my algorithm running it on a cluster, just because I had this stupid perplexity calculation in it and din't know it causes so much trouble.

Community
  • 1
  • 1
Timomo
  • 166
  • 6