0

I have a dataset of news articles that have been collected based on the criteria that they use the term "euroscepticism" or "eurosceptic". I have been running topic models using the lda package (with dfm matrices built in quanteda) in order to identify the main topics of these articles; however, the words I am interested in do not appear in any of the topics. I want to therefore seed these words into the model, and I am not sure exactly how to do that.

I see that the package topicmodels allows for an argument called seedwords, which "can be specified as a matrix or an object class of simple_triplet_matrix", but there are no other instructions. It seems that a simple_triplet_matrix only takes integers, and not strings - does anyone know I would then seed the words 'euroscepticism' and 'eurosceptic' into the model?

Here is a shortened version of the code:

library("quanteda")
library("lda")

##Load UK texts/create corpus
UKcorp <- corpus(textfile(file="~Michael/DM6/*"))

##Create document feature matrix 
UKdfm2 <- dfm(UKcorp, ngrams =1, verbose = TRUE, toLower = TRUE,
         removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
         removeTwitter = FALSE, stem = TRUE, ignoredFeatures =     
         stopwords(kind="english"), keptFeatures = NULL, language = "english",     
         thesaurus = NULL, dictionary = NULL, valuetype = "fixed"))

##Convert to lda model 
UKlda2 <- convert(UKdfm2, to = "lda")

##run model
UKmod2 <- lda.collapsed.gibbs.sampler(UKlda2$documents, K = 15, UKlda2$vocab,  
          num.iterations = 1500, alpha = .1,eta = .01, initial = NULL, burnin 
          = NULL, compute.log.likelihood = TRUE, trace = 0L, freeze.topics = FALSE)
J_F
  • 9,956
  • 2
  • 31
  • 55
  • Are you sure that the words you are after are in the `dtm` prior to running the `lda`? If the words are rather sparse, they may be dropped. Also, you are using `stem = TRUE`. This may stem the word "euroscepticism" down to just 'euro'. MIght be something to check out. – Bryan Goggin Jun 09 '16 at 13:13

1 Answers1

0

"Seeding" words in the topicmodels package is a different procedure, as it allows you when estimating through the collapsed Gibbs sampler to attach prior weights for words. (See for instance Jagarlamudi, J., Daumé, H., III, & Udupa, R. (2012). Incorporating lexical priors into topic models (pp. 204–213). Association for Computational Linguistics.) But this is part of an estimation strategy for topics, not a way of ensuring that key words of interest remain in your fitted topics. Unless you have set a threshold for removing them based on sparsity, before calling lad::lda.collapsed.gibbs.sampler(), then *every* term in yourUKlda2$vocab` vector will be assigned probabilities across topics.

Probably what is happening here is that your words are either of such low frequency that they are hard to locate near the top of any of your topics. It's also possible that stemming has changed them, e.g.:

quanteda::char_wordstem("euroscepticism")
## [1] "eurosceptic"

I suggest you make sure that your words exist in dfm first, through:

colSums(UKdfm2)["eurosceptic"]

And then you can look at the fitted distribution of topic proportions for this word and others in the fitted topic model object.

Ken Benoit
  • 14,454
  • 27
  • 50
  • 1
    Ah, I see. I thought that seeding was a way to 'build' the model around certain keywords by prioritizing them with a higher assigned weight. Eurosceptic does appear in the dfm, but it must not be strongly associated with topics generated. I can find it if I show the top 30 words in a topic, for example. It's a bit odd that as the word guiding the search criteria, it does not show up high in the topics. Says something in-and-of-itself. Thanks for the reply! – Michael Bossetta Jun 10 '16 at 15:01