I'm using anaconda enviroment python 3.7, gensim 3.8.0, basically. I have my data as a dataframe tha tI separated in a test and training set, they both have this structure:
X_test and Xtrain dataframe format :
id alltext
1710 3264537 [exmodelo, karen, mcdougal, asegura, mantuvo, ...
8211 3272079 [grupo, socialista, pionero, supone, apoyar, n...
1885 3263933 [parte, entrenador, zaragoza, javier, aguirre,...
2481 3263744 [fans, hielo, fuego, saga, literaria, dio, pie...
2975 3265302 [actividad, busca, repetir, tres, ediciones, a...
already preprocessed.
This is the code I use for creating my model
id2word = corpora.Dictionary(X_train["alltext"])
texts = X_train["alltext"]
corpus = [id2word.doc2bow(text) for text in texts]
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=20,
random_state=100,
update_every=1,
chunksize=400,
passes=10,
alpha='auto',
per_word_topics=True)enter code here
Until here, everything works fine. I can effectively use
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
to get my topics.
The problem comes, when I try to compare similarity between a new document and the corpus. Here is the code I'm using
newddoc = X_test["alltext"][2730] #I get a particular instance of the test_set
new_doc_freq_vector = id2word.doc2bow(newddoc) #vectorize its list of words
model_vec= lda_model[new_doc_freq_vector] #run the trained model on it
index = similarities.MatrixSimilarity(lda_model[corpus]) # error
sims = index[model_vec] #error
In the last two lines, I get this error:
-------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-110-352248c464f8> in <module>
4
5 #index = Similarity('model/indexes/similarity_index_01', lda_model[corpus], num_features=len(id2word)) #the first argument, the place where the
----> 6 index = similarities.MatrixSimilarity(lda_model[corpus]) # funciona si en vez de lda_model[corpus] usamos solo corpus
7 index = similarities.MatrixSimilarity(model_vec)
8 #sims = index[model_vec] #funciona si usamos index[new_doc_freq_vector] en vez de model_vec
~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\similarities\docsim.py in __init__(self, corpus, num_best, dtype, num_features, chunksize, corpus_len)
776 "scanning corpus to determine the number of features (consider setting `num_features` explicitly)"
777 )
--> 778 num_features = 1 + utils.get_max_id(corpus)
779
780 self.num_features = num_features
~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\utils.py in get_max_id(corpus)
734 for document in corpus:
735 if document:
--> 736 maxid = max(maxid, max(fieldid for fieldid, _ in document))
737 return maxid
738
~\AppData\Local\Continuum\anaconda3\envs\lda_henneo_01\lib\site-packages\gensim\utils.py in <genexpr>(.0)
734 for document in corpus:
735 if document:
--> 736 maxid = max(maxid, max(fieldid for fieldid, _ in document))
737 return maxid
738
ValueError: too many values to unpack (expected 2
No idea how to solve this, I have been trying to debug this for 3 hours now. , I believe I followed the same code many other people use fot getting similarity.
Things I have tried to solve this:
1) Using
Similarity('model/indexes/similarity_index_01', lda_model[corpus], num_features=len(id2word))
.
But it did not work. Same error code was obtained.
2) If I replace lda_model[corpus] with corpus, and index[model_vec] with index[new_doc_freq_vector], similarities.MatrixSimilarity() works. But I believe it does not give the proper result because, it does not have the model information in there. The fact that it works it tells me it has something to do with data types (?), if I print lda_model[corpus] I get
<gensim.interfaces.TransformedCorpus object at 0x00000221ECA8E148>
no Idea what this means though.