9

I tried generating topics using gensim for 300000 records. On trying to visualize the topics, I get a validation error. I can print the topics after model training, but it fails on using pyLDAvis

# Running and Training LDA model on the document term matrix.
ldamodel1 = Lda(doc_term_matrix1, num_topics=10, id2word = dictionary1, passes=50, workers = 4)

(ldamodel1.print_topics(num_topics=10, num_words = 10))
 #pyLDAvis
d = gensim.corpora.Dictionary.load('dictionary1.dict')
c = gensim.corpora.MmCorpus('corpus.mm')
lda = gensim.models.LdaModel.load('topic.model')

#error on executing this line
data = pyLDAvis.gensim.prepare(lda, c, d)

I got the below error on trying to after running above pyLDAvis

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-53-33fd88b65056> in <module>()
----> 1 data = pyLDAvis.gensim.prepare(lda, c, d)
      2 data

C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\gensim.py in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs)
    110     """
    111     opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
--> 112     return vis_prepare(**opts)

C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics)
    372    doc_lengths      = _series_with_name(doc_lengths, 'doc_length')
    373    vocab            = _series_with_name(vocab, 'vocab')
--> 374    _input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency)
    375    R = min(R, len(vocab))
    376 

C:\ProgramData\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py in _input_validate(*args)
     63    res = _input_check(*args)
     64    if res:
---> 65       raise ValidationError('\n' + '\n'.join([' * ' + s for s in res]))
     66 
     67 

ValidationError: 
 * Not all rows (distributions) in topic_term_dists sum to 1.
Hackerds
  • 1,195
  • 2
  • 16
  • 34
  • Ran into same issue when switching from training docs to another set of docs. You sure its the same dictionary. Might be loading from an older version. – JJFord3 Mar 01 '18 at 20:48
  • 2
    Check that your corpus contains no NaNs, Nones, '-'s, etc. This is usually because LDA, NMF, etc. doesn't know how to deal with documents that are too short or otherwise invalid. – Derek Allums Mar 14 '18 at 16:09

4 Answers4

5

This happens because the pyLDAvis program expects that all document topics in the model show up in the corpus at least once. This can happen when you do some preprocessing after making your corpus/text and before making your model.

A word in the model's internal dictionary that is not used in the dictionary you provide will cause this to fail because now the probability is slightly less than one.

You can fix this by either adding the missing words to your corpus dictionary (or adding the words to the corpus and making a dictionary from that) or you can add this line to the site-packages\pyLDAvis\gensim.py code before "assert topic_term_dists.shape[0] == doc_topic_dists.shape[1]" (should be ~line 67)

topic_term_dists = topic_term_dists / topic_term_dists.sum(axis=1)[:, None]

Assuming your code ran till that point, this should renormalize the topic distribution without the missing dict items. But note that it would be better to include all terms in the corpus.

AzureX
  • 71
  • 1
  • 4
2

I ran into the same validation error.

What my issue was that even though PyLDAVIS went through a normalization step (see sklearn.py, _row_norm) to ensure that the doc_topic_dists and topic_term_dists probabilities sum to 1, if none of the words in your document actually show up in the document term matrix (e.g. Matrix of 0s), then this function does not ensure your probabilities equal 1. Your probabilities can only sum to 0!

Run a sum on your document vectors. If there is indeed a 0, then you might want to drop that row/document.

np.sum(lda.transform(docu_term_matrix),axis=1)
GarryC
  • 41
  • 1
  • 5
0

This happened in my HDPModel after I filtered my dictionary- I was left with a lot of zero length documents, which produced this error. I eliminated them before saving my MmCorpus to disk via corpora.MmCorpus.serialize(args.save_folder + '/gensim.mm', (x for x in corpus if len(x) > 0)) which solved the problem when running HDP later. corpus is the generator for my text documents.

user108569
  • 450
  • 5
  • 8
0

I had this problem and solved it by filtering out words with very low frequencies from the dictionary, before generating the corpus.

dictionary.filter_extremes(no_below=2, no_above=1.0).

I suspect that summing a lot of extremely low probabilities won't sum to 1 due to floating point approximation.

Mapad
  • 8,407
  • 5
  • 41
  • 40