How does spark LDA handle non-integer token counts (e.g. TF-IDF)

Question

I have been running a series of topic modeling experiments in Spark, varying the number of topics. So, given an RDD docsWithFeatures, I'm doing something like this:

for (n_topics <- Range(65,301,5) ){
    val s = n_topics.toString
    val lda = new LDA().setK(n_topics).setMaxIterations(20) // .setAlpha(), .setBeta()
    val ldaModel = lda.run(docsWithFeatures)
    // now do some eval, save results to file, etc...

This has been working great, but I also want to compare results if I first normalize my data with TF-IDF. Now, to the best of my knowledge, LDA strictly expects a bag-of-words format where term frequencies are integer values. But in principal (and I've seen plenty of examples of this), the math works out fine if we first convert integer term frequencies to float TF-IDF values. My approach at the moment to do this is the following (again given my docsWithFeatures rdd):

val index_reset = docsWithFeatures.map(_._2).cache()
val idf = new IDF().fit(index_reset)
val tfidf = idf.transform(index_reset).zipWithIndex.map(x => (x._2,x._1))

I can then run the same code as in teh first block, substituting tfidf for docsWithFeatures. This works without any crashes, but my main question here is whether this is OK to do. That is, I want to make sure Spark isn't doing anything funky under the hood, like converting the float values coming out of the TFIDF to integers or something.

I don't believe the Spark code is converting the weights to integers at any point. That said, I have a hard time seeing how using weights obtained from TF-IDF makes much sense given the probabilistic model underlying LDA. — Jason Scott Lenderman, Dec 13 '15 at 07:17
Well, I guess it depends on how it's implemented. I've been doing some poking around, and there does appear to be some precedent for normalizing term frequencies with TF-IDF before applying LDA (hence my question). — moustachio, Dec 13 '15 at 16:51
Do you have any references or examples? Certainly there are interpretations of non-integer weights to which LDA can be sensibly generalized (and many implementations will handle this by default), but I'm not seeing it for weights obtained via TF-IDF. — Jason Scott Lenderman, Dec 13 '15 at 21:08
I haven't found many formal references, but there seem to be people trying it with tools like Gensim (e.g. [here](https://stackoverflow.com/questions/27147690/should-i-use-tfidf-corpus-or-just-corpus-to-inference-documents-using-lda) and [here](https://groups.google.com/forum/#!topic/gensim/OESG1jcaXaQ)) — moustachio, Dec 14 '15 at 17:02
But if you're saying that we can sensibly generalize LDA to non-integer weights, what is special or problematic about TFIDF? Essentially I'm thinking to use it as a reweighing scheme to penalize particularly common words in a better way than stopwording (because I am interested in cases where common words are *especially* common in a given document, something stopwording won't get you). — moustachio, Dec 14 '15 at 17:04
LDA will not have a sensible probabilistic interpretation for any possible way of obtaining non-integer token counts, even if the algorithm still terminates and gives a result. — Jason Scott Lenderman, Dec 14 '15 at 20:09
By the way, are you using EMLDAOptimizer or OnlineLDAOptimizer? I suggest using the latter, since it optimizes the parameters of the prior for the per-document topic mixing weights (while EMLDAOptimizer does not.) But even then the Dirichlet prior might be too restrictive, resulting in less than stellar looking topics. — Jason Scott Lenderman, Dec 14 '15 at 20:30
I'm actually not sure....which is the default? I'm not manually specifying the optimizer beyond what you see above — moustachio, Dec 14 '15 at 20:31
The default, I believe, is `EMLDAOptimizer`. You might find that the topics you get from using `OnlineLDAOptimizer` are more reasonable looking. See this paper: http://papers.nips.cc/paper/3854-rethinking-lda-why-priors-matter . — Jason Scott Lenderman, Dec 15 '15 at 04:14
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/98003/discussion-between-moustachio-and-jason-lenderman). — moustachio, Dec 15 '15 at 16:37

How does spark LDA handle non-integer token counts (e.g. TF-IDF)

0 Answers0