12

I'm following the 'English Wikipedia' gensim tutorial at https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation

where it explains that tf-idf is used during training (at least for LSA, not so clear with LDA).

I expected to apply a tf-idf transformer to new documents, but instead, at the end of the tut, it suggests to simply input a bag-of-words.

doc_lda = lda[doc_bow]

Does LDA require bag-of-words vectors only?

Luke W
  • 8,276
  • 5
  • 44
  • 36
  • Related to https://stackoverflow.com/questions/25915441/term-weighting-for-original-lda-in-gensim but i'm not sure what 'original' LDA means. – Luke W Jun 27 '17 at 13:08

2 Answers2

23

TL;DR: Yes, LDA only needs a bag-of-word vector.

Indeed, in the Wikipedia example of the gensim tutorial, Radim Rehurek uses the TF-IDF corpus generated in the preprocessing step.

mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')

I believe the reason for that is only that this matrix is sparse and easy to handle (and already exists anyways due to the preprocessing step).

LDA does not necessarily need to be trained on a TF-IDF corpus. The model works just fine if you use the corpus shown in the gensim tutorial Corpora and Vector Spaces:

from gensim import corpora, models
texts = [['human', 'interface', 'computer'],
         ['survey', 'user', 'computer', 'system', 'response', 'time'],
         ['eps', 'user', 'interface', 'system'],
         ['system', 'human', 'system', 'eps'],
         ['user', 'response', 'time'],
         ['trees'],
         ['graph', 'trees'],
         ['graph', 'minors', 'trees'],
         ['graph', 'minors', 'survey']]

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1, chunksize =10000, passes=1)

Notice that texts is a bag-of-word vector. As you pointed out correctly, that is the center piece of the LDA model. TF-IDF does not play any role in it at all.

In fact, Blei (who developed LDA), points out in the introduction of the paper of 2003 (entitled "Latent Dirichlet Allocation") that LDA addresses the shortcomings of the TF-IDF model and leaves this approach behind. LSA is compeltely algebraic and generally (but not necessarily) uses a TF-IDF matrix, while LDA is a probabilistic model that tries to estimate probability distributions for topics in documents and words in topics. The weighting of TF-IDF is not necessary for this.

petezurich
  • 9,280
  • 9
  • 43
  • 57
Jérôme Bau
  • 707
  • 5
  • 16
16

Not to disagree with Jérôme's answer, tf-idf is used in the latent dirichlet allocation to some extent. As can be read in the paper Topic Models by Blei and Lafferty (e.g. p.6 - Visualizing Topics and p.12), the tf-idf score can be very useful for LDA. It can be used to visualize topics or to chose the vocabulary. "It is often computationally expensive to use the entire vocabulary. Choosing the top V words by TFIDF is an effective way to prune the vocabulary".

This said, LDA does not need tf-idf to infer topics, but it can be useful and it can improve your results.

bbrinx
  • 846
  • 8
  • 14
  • 1
    I Agree! I tried LDA with both BOW & TF-IDF on 100K data and the results tend to improve a little and topics made more sense w.r.t my data when using TF-IDF. Will dig more and share my finding! – satish silveri Oct 01 '19 at 09:01