15

I used Latent Dirichlet Allocation (sklearn implementation) to analyse about 500 scientific article-abstracts and I got topics containing most important words (in german language). My problem is to interpret these values associated with the most important words. I assumed to get probabilities for all words per topic which add up to 1, which is not the case.

How can I interpret these values? For example I would like to be able to tell why topic #20 has words with much higher values than other topics. Has their absolute height to do with Bayesian probability? Is the topic more common in the corpus? Im not yet able to bring together theses values with the math behind the LDA.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=1, top_words=stop_ger,
                                analyzer='word',
                                tokenizer = stemmer_sklearn.stem_ger())

tf = tf_vectorizer.fit_transform(texts)

n_topics = 10
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5, 
                                learning_method='online',                 
                                learning_offset=50., random_state=0)

lda.fit(tf)

def print_top_words(model, feature_names, n_top_words):
    for topic_id, topic in enumerate(model.components_):
        print('\nTopic Nr.%d:' % int(topic_id + 1)) 
        print(''.join([feature_names[i] + ' ' + str(round(topic[i], 2))
              +' | ' for i in topic.argsort()[:-n_top_words - 1:-1]]))

n_top_words = 4
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Topic Nr.1: demenzforsch 1.31 | fotus 1.21 | umwelteinfluss 1.16 | forschungsergebnis 1.04 |
Topic Nr.2: fur 1.47 | zwisch 0.94 | uber 0.81 | kontext 0.8 |
...
Topic Nr.20: werd 405.12 | fur 399.62 | sozial 212.31 | beitrag 177.95 | 
LSz
  • 161
  • 1
  • 6
  • 1
    Did you reach any conclusion about this? I'm facing the same problem. Have you tried the score method? On my code it returns a NaN... – João Almeida Mar 07 '16 at 13:00
  • 1
    Just found [this isssue](https://github.com/scikit-learn/scikit-learn/issues/6353) on the scikit-learn github, this implementation seems to still have too many bugs to be useful. Probably is better to use [gensim](https://radimrehurek.com/gensim/) instead. – João Almeida Mar 07 '16 at 13:39
  • Thanks for sharing the gihub link! – LSz Mar 09 '16 at 15:44
  • 1
    I discovered the lambda-formular connected to sklearns `components_` in "Latent Dirichlet Allocation" Blei/Ng/Jordan (p.1007). There is no normalisation either, so I think the sklearn implementation is correct. In my case it was quiet interesting to get very high values for one topic. This also fitted well to the common meaning of those tokens of this topic. I think the differences in value height go together with the dirichlet distribution, so higher values mean that topics occure more often in the corpus. If I'm right, we actually lose information by normalising. – LSz Mar 09 '16 at 15:45
  • You may lose "local" information, meaning a certain biais toward your dataset. It is not a bad thing for generalization though. – patrickmesana Sep 08 '17 at 01:18

1 Answers1

6

From the documentation

components_ Variational parameters for topic word distribution. Since the complete conditional for topic word distribution is a Dirichlet, components_[i, j] can be viewed as pseudocount that represents the number of times word j was assigned to topic i. It can also be viewed as distribution over the words for each topic after normalization: model.components_ / model.components_.sum(axis=1)[:, np.newaxis].

So the values can be seen as a distribution if you normalize over the component to evaluate the importance of each term in the topic. AFAIU you cannot use the pseudo-count to compare the importance of two topics in the corpus as they are a smoothing factor applied to the term-topic distribution.

Simon Thordal
  • 893
  • 10
  • 28