scikit-learn - Should I fit model with TF or TF-IDF?

Question

I am trying to find out the best way to fit different probabilistic models (like Latent Dirichlet Allocation, Non-negative Matrix Factorization, etc) on sklearn (Python).

Looking at the example in the sklearn documentation, I was wondering why the LDA model is fit on a TF array, while the NMF model is fit on a TF-IDF array. Is there a precise reason for this choice?

Here is the example: http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py

Also, any tips about how to find the best parameters (number of iterations, number of topics...) for fitting my models is well accepted.

Thank you in advance.

Just a comment for parameter optimization. You should check resources about meta-optimization techniques (for example applying Genetic Algorithms or PSO - Particle Swarm Optimization - on your algorithm is guaranteed to yield optimal parameter values for the given setup). Meta-optimization is a quick and efficient way to traverse through the search space of each possible parameter combination. — rpd, Oct 21 '16 at 08:00

score 3 · Answer 1 · answered Oct 24 '16 at 17:24

To make the answer clear one must first examine the definitions of the two models.

LDA is a probabilistic generative model that generates documents by sampling a topic for each word and then a word from the sampled topic. The generated document is represented as a bag of words.

NMF is in its general definition the search for 2 matrices W and H such that W*H=V where V is an observed matrix. The only requirement for those matrices is that all their elements must be non negative.

From the above definitions it is clear that in LDA only bag of words frequency counts can be used since a vector of reals makes no sense. Did we create a word 1.2 times? On the other hand we can use any non negative representation for NMF and in the example tf-idf is used.

As far as choosing the number of iterations, for the NMF in scikit learn I don't know the stopping criterion although I believe it is the relative improvement of the loss function being smaller than a threshold so you 'll have to experiment. For LDA I suggest checking manually the improvement of the log likelihood in a held out validation set and stopping when it falls under a threshold.

The rest of the parameters depend heavily on the data so I suggest, as suggested by @rpd, that you do a parameter search.

So to sum up, LDA can only generate frequencies and NMF can generate any non negative matrix.

Thank you very much @katharas! As far as tuning the other parameters, I was using perplexity as a mesure of the good fit of my model, trying to estimate the parameters (alpha, tau_0 and batch-size) that give me the lower perplexity. Do you think this is also a good way of evaluating the parameters for LDA? — Luca P., Oct 25 '16 at 10:18
Yes perplexity is fine (maybe better actually). Essentially perplexity is the exponentiated per word log likelihood. It is used so that one can compare values for different documents and corpora. — katharas, Oct 25 '16 at 16:49

scikit-learn - Should I fit model with TF or TF-IDF?

1 Answers1