3

Periodically, when I run topic analyses on data and try to visualize using pyLDAvis, I get a validation error: "Not all rows (distributions) in doc_topic_dists sum to 1." Here's some basic code. Some code below:

tfidf_vectorizer = TfidfVectorizer(max_df=.95, min_df=1, max_features=None, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(lines2)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
nmf = NMF(n_components=3, random_state=None, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
panel = pyLDAvis.sklearn.prepare(nmf, tfidf, tfidf_vectorizer, mds='tsne')

The culprit is the last statement there (the 'panel = ' statement); evidently, the matrix produced by nmf.transform(tfidf) contains some rows that are all zeroes, so the attempt to normalize the rows by centering them on the column means returns nan's. No combination of model parameters seems to fix this (in fact they often seem to make the problem worse by producing more rows with nan's).

FWIW, the data involved is the text of Tweets from BBC Health, so the average response length is fairly short -- a little under 4,000 records, with each record an average of 4.8 words long. Nonetheless I've verified that the zero-weighted responses all include words that were included in the model vocabulary, so I'm unsure why the problem or how to fix it.

If there is no way to fix it, would it be sensible simply to substitute in the column means in these cases?

sw85
  • 53
  • 4

0 Answers0