0

I am working on a project where I need to apply topic modelling to a set of documents and I need to create a matrix :

DT , a D × T matrix, where D is the number of documents and T is the number of topics. DT(ij) contains the number of times a word in document Di has been assigned to topic Tj.

So far I have followed this tut: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

I am new to gensim and so far I have 1. created a document list 2. preprocessed and tokenized the documents. 3. Used corpora.Dictionary() to create id-> term dictionary (id2word) 4. convert tokenized documents into a document-term matrix

generated an LDA model. So now I get the topics.

How can I now get the matrix that I mentioned before. I will be using this matrix to calculate similarity between 2 documents on topic t as :

sim(a,b) = 1- |DT(a,t) - DT(b, t)|

swati saoji
  • 1,987
  • 5
  • 25
  • 35

3 Answers3

2

There is an implementation in the pyLDAvis source code which returns the lists that may be helpful for building the matrix you are interested in.

Snippet from the _extract_data method in gensim.py:

def _extract_data(topic_model, corpus, dictionary, doc_topic_dists=None):
    ...
    ...
    ...    
    return {'topic_term_dists': topic_term_dists, 'doc_topic_dists': doc_topic_dists,
        'doc_lengths': doc_lengths, 'vocab': vocab, 'term_frequency': term_freqs}

The number of topics for your model will be static. Maybe you're interested in finding the document topic distribution for the T matrix. In that case, the DxT matrix would be doc_lengths x doc_topic_dists.

Kenneth Orton
  • 399
  • 1
  • 11
  • 1
    This method has now moved to the `gensim_models.py` script and is not published [at their docs](https://pyldavis.readthedocs.io/en/latest/_modules/index.html) anymore – Bilbottom Apr 24 '21 at 17:38
1

Showing your code would be helpful, but if we were to go off of the example in the tutorial you linked then the model is identified by:

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

you could put into your script something like:

model_name = "name_of_my_model"
ldamodel.save(model_name)

Then when you run it, this will create a model in the same directory that the script is run from.

Then you can get topic probability distribution with:

print(ldamodel[doc_bow])

If you want to get similarity to this model then you need to create a model for the query document, too, and then get cosine similarity between the two:

dictionary = corpora.Dictionary.load('dictionary.dict')
corpus = corpora.MmCorpus("corpus.mm")
lda = models.LdaModel.load("name_of_my_model.lda")

index = similarities.MatrixSimilarity(lda[corpus])
index.save("simIndex.index")

docname = "docs/the_doc.txt"
doc = open(docname, 'r').read()
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]

sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims
tadamhicks
  • 905
  • 1
  • 14
  • 34
  • Thanks for answering the question. I get the proces you described for finding the similarity. But I am actually trying to implement some similarity measure suggested in an article. I actually need this matrix : DT , a D × T matrix, where D is the number of documents and T is the number of topics. DT(ij) contains the number of times a word in document Di has been assigned to topic Tj. These are the same documents that are used to generate LDA model. – swati saoji Mar 24 '16 at 05:05
0

Supposing you have a saved lda model and that df is a list of lists of tokens :

import gensim

model = gensim.models.LdaModel.load('C:/model_location/model')
id2word = gensim.corpora.Dictionary.load('C:/model_location/model.id2word')
corpus = [id2word.doc2bow(text) for text in df]

document_topic_matrix = [list(dict(model.get_document_topics(doc, minimum_probability=0)).values()) for doc in corpus]