4

Context

I'm trying to extract topics from a set of texts using Latent Dirichlet allocation from Scikit-Learn's decomposition module. This works really well, except for the quality of topic words found/selected.

In a article by Li et al (2017), the authors describe using prior topic words as input for the LDA. They manually choose 4 topics and the main words associated/belonging to these topics. For these words they set the default value to a high number for the associated topic and 0 for the other topics. All other words (not manually selected for a topic) are given equal values for all topics (1). This matrix of values is used as input for the LDA.

My question

How can I create a similar analysis with the LatentDirichletAllocation module from Scikit-Learn using a customized default values matrix (prior topics words) as input?

(I know there's a topic_word_prior parameter, but it only takes one float instead of a matrix with different 'default values'.)

vvvvv
  • 25,404
  • 19
  • 49
  • 81
Philip
  • 2,888
  • 2
  • 24
  • 36
  • Have you tried manually editing the coefficients of the components_ matrix of your model? It seems to me like it's what you are trying to achieve. – Anis Jul 18 '17 at 14:49
  • Thanks for the quick reply, that is what I'm trying to figure out. (I'm not sure which (internal) property I have to/can adjust, and what range of values I can put in there? – Philip Jul 18 '17 at 14:59
  • It seems to me like it's the components_ matrix of your model, since it is directly the one that is used from training. You could use ```model.components_[i, j] = aij``` to set the value aij for topic i and features j. – Anis Jul 18 '17 at 15:14
  • I'm assuming this should happen before fitting the model? And does the range of values matter? (e.g. Can I use the 0, 1 and large positive float?) – Philip Jul 18 '17 at 15:20

2 Answers2

4

After taking a look a the source and the docs, it seems to me like the easiest thing to do is subclass LatentDirichletAllocation and only override the _init_latent_vars method. It is the method called in fit to create the components_ attribute, which is the matrix used for the decomposition. By re-implementing this method, you can set it just the way you want, and in particular, boost the prior weights for the related topics/features. You would re-implement there the logic of the paper for the initialization.

Anis
  • 2,984
  • 17
  • 21
  • Thanks! I'm working along this line and will post the code if I found the solution :) – Philip Jul 18 '17 at 15:51
  • Yes it is really not obvious to initialize this matrix, and I can't really go further it is now up to you. One last thing, take a look at the original implementation for _init_latent_vars, you'll see that there is another matrix called ```exp_dirichlet_component_``` that you'll need to take care of after you're done with ```components_``` – Anis Jul 18 '17 at 16:00
  • Yes, I'm transforming the `components_` matrix with the ptw-matrix before `exp_dirichlet_component_` is calculated, so that should take care of that. Now testing my implementation and I'll keep you posted – Philip Jul 18 '17 at 16:15
  • I am getting the following error, when I try this solution : RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). with constructor (self, ptws=None, *args, **kwargs) doesn't follow this convention. – Venkatachalam Feb 20 '18 at 13:15
  • 1
    @AILearning I updated the code in my solution block. It should now pass the test you mentioned. Drawback doing it like this, is that the code becomes less readable and if default values of the superclass get changed, it won't take those over automatically. But anyway, it should now comply to the suggested standard. – Philip Oct 28 '18 at 00:40
0

Using Anis' help, I created a subclass of the original module, and edited the function that sets the starting values matrix. For all prior topic words you wish to give as input, it transforms the components_ matrix by multiplying the values with the topic values of that (prior) word.

This is the code:

# List with prior topic words as tuples
# (word index, [topic values])
prior_topic_words = []

# Example (word at index 3000 belongs to topic with index 0)
prior_topic_words.append(
    (3000, [(np.finfo(np.float64).max/4),0.,0.,0.,0.])
)

# Custom subclass for PTW-guided LDA
from sklearn.utils import check_random_state
from sklearn.decomposition._online_lda import _dirichlet_expectation_2d
class PTWGuidedLatentDirichletAllocation(LatentDirichletAllocation):

    def __init__(self, n_components=10, doc_topic_prior=None, topic_word_prior=None, learning_method=’batch’, learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None, n_topics=None, ptws=None):
        super(PTWGuidedLatentDirichletAllocation, self).__init__(n_components, doc_topic_prior, topic_word_prior, learning_method, learning_decay, learning_offset, max_iter, batch_size, evaluate_every, total_samples, perp_tol, mean_change_tol, max_doc_update_iter, n_jobs, verbose, random_state, n_topics)
        self.ptws = ptws

    def _init_latent_vars(self, n_features):
        """Initialize latent variables."""

        self.random_state_ = check_random_state(self.random_state)
        self.n_batch_iter_ = 1
        self.n_iter_ = 0

        if self.doc_topic_prior is None:
            self.doc_topic_prior_ = 1. / self.n_topics
        else:
            self.doc_topic_prior_ = self.doc_topic_prior

        if self.topic_word_prior is None:
            self.topic_word_prior_ = 1. / self.n_topics
        else:
            self.topic_word_prior_ = self.topic_word_prior

        init_gamma = 100.
        init_var = 1. / init_gamma
        # In the literature, this is called `lambda`
        self.components_ = self.random_state_.gamma(
            init_gamma, init_var, (self.n_topics, n_features))

        # Transform topic values in matrix for prior topic words
        if self.ptws is not None:
            for ptw in self.ptws:
                word_index = ptw[0]
                word_topic_values = ptw[1]
                self.components_[:, word_index] *= word_topic_values

        # In the literature, this is `exp(E[log(beta)])`
        self.exp_dirichlet_component_ = np.exp(
            _dirichlet_expectation_2d(self.components_))

Initiation is the same as the original LatentDirichletAllocation class, but now you can provide prior topic words using the ptws parameter.

Philip
  • 2,888
  • 2
  • 24
  • 36