Introduction
I'd like to know what other topic modellers consider to be an optimal topic-modelling workflow all the way from pre-processing to maintenance. While this question consists of a number of sub-questions (which I will specify below), I believe this thread would be useful for myself and others who are interested to learn about best practices of end-to-end process.
Proposed Solution Specifications
I'd like the proposed solution to preferably rely on R for text processing (but Python is fine also) and topic-modelling itself to be done in MALLET (although if you believe other solutions work better, please let us know). I tend to use the topicmodels
package in R
, however I would like to switch to MALLET
as it offers many benefits over topicmodels
. It can handle a lot of data, it does not rely on specific text pre-processing tools and it appears to be widely used for this purpose. However some of the issues outline below are also relevant for topicmodels
too. I'd like to know how others approach topic modelling and which of the below steps could be improved. Any useful piece of advice is welcome.
Outline
Here is how it's going to work: I'm going to go through the workflow which in my opinion works reasonably well, and I'm going to outline problems at each step.
Proposed Workflow
1. Clean text
This involves removing punctuation marks, digits, stop words, stemming words and other text-processing tasks. Many of these can be done either as part of term-document matrix decomposition through functions such as for example TermDocumentMatrix
from R
's package tm
.
Problem: This however may need to be performed on the text strings directly, using functions such as gsub
in order for MALLET to consume these strings. Performing in on the strings directly is not as efficient as it involves repetition (e.g. the same word would have to be stemmed several times)
2. Construct features
In this step we construct a term-document matrix (TDM), followed by the filtering of terms based on frequency, and TF-IDF values. It is preferable to limit your bag of features to about 1000 or so. Next go through the terms and identify what requires to be (1) dropped (some stop words will make it through), (2) renamed or (3) merged with existing entries. While I'm familiar with the concept of stem-completion, I find that it rarely works well.
Problem: (1) Unfortunately MALLET
does not work with TDM constructs and to make use of your TDM, you would need to find the difference between the original TDM -- with no features removed -- and the TDM that you are happy with. This difference would become stop words for MALLET. (2) On that note I'd also like to point out that feature selection does require a substantial amount of manual work and if anyone has ideas on how to minimise it, please share your thoughts.
Side note: If you decide to stick with R
alone, then I can recommend the quanteda
package which has a function dfm
that accepts a thesaurus
as one of the parameters. This thesaurus allows to to capture patterns (usually regex) as opposed to words themselves, so for example you could have a pattern \\bsign\\w*.?ups?
that would match sign-up
, signed up
and so on.
3. Find optimal parameters
This is a hard one. I tend to break data into test-train sets and run cross-validation fitting a model of k
topics and testing the fit using held-out data. Log likelihood is recorded and compared for different resolutions of topics.
Problem: Log likelihood does help to understand how good is the fit, but (1) it often tends to suggest that I need more topics than it is practically sensible and (2) given how long it generally takes to fit a model, it is virtually impossible to find or test a grid of optimal values such as iterations, alpha, burn-in and so on.
Side note: When selecting the optimal number of topics, I generally select a range of topics incrementing by 5 or so as incrementing a range by 1 generally takes too long to compute.
4. Maintenance
It is easy to classify new data into a set existing topics. However if you are running it over time, you would naturally expect that some of your topics may cease to be relevant, while new topics may appear. Furthermore, it might be of interest to study the lifecycle of topics. This is difficult to account for as you are dealing with a problem that requires an unsupervised solution and yet for it to be tracked over time, you need to approach it in a supervised way.
Problem: To overcome the above issue, you would need to (1) fit new data into an old set of topics, (2) construct a new topic model based on new data (3) monitor log likelihood values over time and devise a threshold when to switch from old to new; and (4) merge old and new solutions somehow so that the evolution of topics would be revealed to a lay observer.
Recap of Problems
- String cleaning for
MALLET
to consume the data is inefficient. - Feature selection requires manual work.
- Optimal number of topics selection based on LL does not account for what is practically sensible
- Computational complexity does not give the opportunity to find an optimal grid of parameters (other than the number of topics)
- Maintenance of topics over time poses challenging issues as you have to retain history but also reflect what is currently relevant.
If you've read that far, I'd like to thank you, this is a rather long post. If you are interested in the suggest, feel free to either add more questions in the comments that you think are relevant or offer your thoughts on how to overcome some of these problems.
Cheers