1

I use gensim LDA topic modelling to find topics for each document and to check the similarity between documents by comparing the received topics vectors. Each document is given a different number of matching topics, so the comparison of the vector (by cosine similarity) is incorrect because vectors of the same length are required.

This is the related code:

lda_model_bow = models.LdaModel(corpus=bow_corpus, id2word=dictionary, num_topics=3, passes=1, random_state=47)

#---------------Calculating and Viewing the topics----------------------------
vec_bows = [dictionary.doc2bow(filtered_text.split()) for filtered_text in filtered_texts]

vec_lda_topics=[lda_model_bow[vec_bow] for vec_bow in vec_bows]

for id,vec_lda_topic in enumerate(vec_lda_topics):
    print ('document ' ,id, 'topics: ', vec_lda_topic)

The output vectors is:

document  0 topics:  [(1, 0.25697246), (2, 0.08026043), (3, 0.65391296)]
document  1 topics:  [(2, 0.93666667)]
document  2 topics:  [(2, 0.07910537), (3, 0.20132676)]
.....

As you can see, each vector has a different length, so it is not possible to perform cosine similarity between them.

I would like the output to be:

document  0 topics:  [(1, 0.25697246), (2, 0.08026043), (3, 0.65391296)]
document  1 topics:  [(1, 0.0), (2, 0.93666667), (3, 0.0)]
document  2 topics:  [(1, 0.0), (2, 0.07910537), (3, 0.20132676)]
.....

Any ideas how to do it? tnx

Matan
  • 146
  • 13

2 Answers2

1

I have used gensim for topic modeling before and I had not faced this issue. Ideally, if you pass num_topics=3 then it returns top 3 topics with the highest probability for each document. And then you should be able to generate the cosine similarity matrix by doing something like this:

lda_model_bow = models.LdaModel(corpus=bow_corpus, id2word=dictionary, num_topics=3, passes=1, random_state=47)
vec_lda_topics = lda_model_bow[bow_corpus]
sim_matrix = similarities.MatrixSimilarity(vec_lda_topics)

But for some reason, if you are getting unequal number of topics you can assume a zero probability value for the remaining topics and include them in your vector when you calculate similarity.

P.s.: If you could provide a sample of your input documents, it would be easier to reproduce your output and look into it.

panktijk
  • 1,574
  • 8
  • 10
  • According to the documentation of gensim.ldamodel: `num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.` If so, this is the total number of topics to which I want to divide the text rather than the top 3. – Matan Nov 21 '18 at 17:35
  • 1
    Did you also check this parameter: `minimum_probability (float, optional)` – Topics with a probability lower than this threshold will be filtered out? Its default value is `0.01`. It's possible that some of your topics are getting filtered out because of low probability. – panktijk Nov 21 '18 at 18:21
  • I just found this solution and wanted to update here. That's exactly the solution I was looking for. Thank you – Matan Nov 21 '18 at 18:36
0

So as panktijk says in the comment and also this topic , the solution is to cange minimum_probability from the default value of 0.01 to 0.0.

Matan
  • 146
  • 13