-1

I'm using anaconda enviroment python 3.7, gensim 3.8.0, basically. I have my data as a dataframe tha tI separated in a test and training set, they both have this structure:

X_test and Xtrain dataframe format :

        id                                            alltext  
1710  3264537  [exmodelo, karen, mcdougal, asegura, mantuvo, ...   
8211  3272079  [grupo, socialista, pionero, supone, apoyar, n...   
1885  3263933  [parte, entrenador, zaragoza, javier, aguirre,...   
2481  3263744  [fans, hielo, fuego, saga, literaria, dio, pie...   
2975  3265302  [actividad, busca, repetir, tres, ediciones, a... 

already preprocessed.

This is the code I use for creating my model

id2word = corpora.Dictionary(X_train["alltext"])   
texts = X_train["alltext"]
corpus = [id2word.doc2bow(text) for text in texts]

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=20,
                                       random_state=100, 
                                       update_every=1, 
                                       chunksize=400, 
                                       passes=10, 
                                       alpha='auto',
                                       per_word_topics=True)enter code here

Until here, everything works fine. I can effectively use

pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

to get my topics.

The problem comes, when I try to compare similarity between a new document and the corpus. Here is the code I'm using

newddoc = X_test["alltext"][2730] #I get a particular instance of the test_set
new_doc_freq_vector = id2word.doc2bow(newddoc)  #vectorize its list of words
model_vec= lda_model[new_doc_freq_vector] #run the trained model on it
index = similarities.MatrixSimilarity(lda_model[corpus]) # error
sims = index[model_vec] #error

In the last two lines, I get this error:

-------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-110-352248c464f8> in <module>
      4 
      5 #index = Similarity('model/indexes/similarity_index_01', lda_model[corpus], num_features=len(id2word)) #the first argument, the place where the
----> 6 index = similarities.MatrixSimilarity(lda_model[corpus]) # funciona si en vez de lda_model[corpus] usamos solo corpus
      7 index = similarities.MatrixSimilarity(model_vec)
      8 #sims = index[model_vec] #funciona si usamos index[new_doc_freq_vector] en vez de model_vec

~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\similarities\docsim.py in __init__(self, corpus, num_best, dtype, num_features, chunksize, corpus_len)
    776                 "scanning corpus to determine the number of features (consider setting `num_features` explicitly)"
    777             )
--> 778             num_features = 1 + utils.get_max_id(corpus)
    779 
    780         self.num_features = num_features

~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\utils.py in get_max_id(corpus)
    734     for document in corpus:
    735         if document:
--> 736             maxid = max(maxid, max(fieldid for fieldid, _ in document))
    737     return maxid
    738 

~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\utils.py in <genexpr>(.0)
    734     for document in corpus:
    735         if document:
--> 736             maxid = max(maxid, max(fieldid for fieldid, _ in document))
    737     return maxid
    738 

ValueError: too many values to unpack (expected 2

No idea how to solve this, I have been trying to debug this for 3 hours now. , I believe I followed the same code many other people use fot getting similarity.

Things I have tried to solve this:

1) Using

Similarity('model/indexes/similarity_index_01', lda_model[corpus], num_features=len(id2word)).

But it did not work. Same error code was obtained.

2) If I replace lda_model[corpus] with corpus, and index[model_vec] with index[new_doc_freq_vector], similarities.MatrixSimilarity() works. But I believe it does not give the proper result because, it does not have the model information in there. The fact that it works it tells me it has something to do with data types (?), if I print lda_model[corpus] I get

<gensim.interfaces.TransformedCorpus object at 0x00000221ECA8E148>

no Idea what this means though.

brandata
  • 81
  • 9

2 Answers2

1

From here: https://radimrehurek.com/gensim/similarities/docsim.html#gensim.similarities.docsim.MatrixSimilarity

MatrixSimilarity takes 2 parameters:

# num_features (int) – Size of the dictionary (number of features).
MatrixSimilarity(common_corpus, num_features=len(common_dictionary))

Hope this helps. Good luck.

Harshal Parekh
  • 5,918
  • 4
  • 21
  • 43
  • 1
    Yes, it looks like after reading documentation, you can't use matrix similarity with an LDA output right from the start, because the output does not come as a as a matrix with similarity coefficients ( like the lsi function does, thats why lsi is always recommended for that function in the documentation). To do similarity with LDA I need first to transform the output into something that resembes lsi' s output, for example, and then use MatrixSimilarity() function. There are details of the theoric justification of this here http://ceur-ws.org/Vol-1815/paper4.pdf – brandata Oct 27 '19 at 08:00
0

If you change the lda to lda_model = gensim.models.ldamodel.LdaModel(corpus=vectorized_corpus, id2word=id2word, num_topics=20, random_state=100, update_every=1, chunksize=400, passes=10, alpha='auto'

the similarity works, the reason is the per_word_topic = true argument originally posted, causes the output of model_vec= lda_model[new_doc_freq_vector] to be the distribution of each of the words inside that document per topic (the probability of each word to be to each topic) instead of a list with the probabilities of that document being part of each topic, this difference in format causes the similarity function to give an error if you have that argument as "True". If you take it off it works fine. More details here https://github.com/RaRe-Technologies/gensim/issues/2644

brandata
  • 81
  • 9