-1

I'm doing am LDA topic model on a medium sized corpus using gensim in python. We already know roughly some of the topics we're expecting. In particular, we know that a particular topic definitely exists within the corpus and we want the model to find that topic for us so that we can extract the elements of the corpus that fall under that topic.

Is there a way of manually setting the initial conditions of one of your topics in gensim to give the model a shove in the 'right' direction?

The idea would be to take a handful of known examples of the target topics and set the probabilities of each words to their frequency within the known examples. Or something in the neighborhood of that idea.

Thanks in advance for your help!

1 Answers1

0

As LDA is traditionally an unsupervised method, it's more common to let it tell you what topics it finds by its rules, then see which (if any) of those match your preconceptions.

Gensim has no way to pre-seed an LDA model/session with biases towards finding/defining certain topics.

You might use your conceptions of a topic that "should" exist, or certain documents that "should" be together, to tune your choice of other parameters to ensure final results best meet that goal, or to postprocess the LDA results with labeling/combinations to match your desired groupings.

But also, if one topic is of preeminent importance, or has your best set of labeled training examples, you may want to consider training a binary classifier to predict whether documents are in that topic, or not. Or, as your ideas of preferable topics, with labeled examples, grows, a multi-label classifier to assign documents to topics.

Classifiers are the more appropriate tool when you want a system to deduce known categories, though of course hybrid approaches can also be useful. For example, LDA runs may help suggest new categories, and the outputs of an LDA run could be added as features to assist downstream supervised classifiers. Or documents decorated with extra tokens from supervised classification could be analyzed by downstream LDA.

(In fact, simply decorating documents that are in a desired known category with an extra synthetic token representing that category might be a interesting way to bias an LDA toward reflecting those categories, but you'd want a rigorous evaluation process, for deciding whether such a hack was overall improving your true end goals or not.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    Yeh I'm aware that LDA is normally unsupervised. What I'm looking to do is run it in a kind of... 'semi-supervised' way. We can't train a binary classifier as we don't have pre-sorted data. That's why we're using the LDA as a method of sorting (but we know the sort of dimensions we want to sort by). The decoration is an good idea but runs into the same issue! We'd need pre-sorted data to know what to add the decorations to. Thanks, though! :) – Gareth Pearce Nov 02 '22 at 12:49
  • Well, if you do have a "handful of known examples of the target topics", then that's all you need to train an initial classifier, binary or multiclass. Then, look at some of the examples that classifier reports as being marginal (or the unlabeled same-topic siblings of known examples from an unsupervised technique) & confirm/reject those initially-weak judgements about a potential-labelsameness, expanding the number of labeled examples in that process (especially around the 'hard cases'). – gojomo Nov 02 '22 at 16:28
  • *Every* evaluation process that can tell you if your automated processes is getting better will need *some* labeled data, and as soon as you reach a plateau on what you can do with your initial tiny set of examples, you'll need to hand-review more via such steps. So you might as well formalize some sort of review & additional-labeling as soon as possible. Further, whatever pre-initialization weights you think you might be able to inject are really just an untested conjecture about what might help until after formally tested against a real challenge. – gojomo Nov 02 '22 at 16:28
  • If you've only got, say, 5 example documents, then that's *also* all the "pre-tuning" input you've got. Every technique, theoretical or real, will tryito bootstrap a head start from just those few same morsels. So you shouldn't assume them "too little" for anything until after you've tried applying them. (If you believe all 5 to belong to the same category, do they all land in a single topic via LDA, at some choice of parameters? Does hand-review of unlabeled siblings in same topic confirm further your desired category? If so, you now have more pre-sorted data for use in experiments. Etc.) – gojomo Nov 02 '22 at 16:34