I have a dataset of news articles that have been collected based on the criteria that they use the term "euroscepticism" or "eurosceptic". I have been running topic models using the lda
package (with dfm
matrices built in quanteda
) in order to identify the main topics of these articles; however, the words I am interested in do not appear in any of the topics. I want to therefore seed these words into the model, and I am not sure exactly how to do that.
I see that the package topicmodels
allows for an argument called seedwords, which "can be specified as a matrix
or an object class of simple_triplet_matrix
", but there are no other instructions. It seems that a simple_triplet_matrix
only takes integers, and not strings - does anyone know I would then seed the words 'euroscepticism' and 'eurosceptic' into the model?
Here is a shortened version of the code:
library("quanteda")
library("lda")
##Load UK texts/create corpus
UKcorp <- corpus(textfile(file="~Michael/DM6/*"))
##Create document feature matrix
UKdfm2 <- dfm(UKcorp, ngrams =1, verbose = TRUE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = FALSE, stem = TRUE, ignoredFeatures =
stopwords(kind="english"), keptFeatures = NULL, language = "english",
thesaurus = NULL, dictionary = NULL, valuetype = "fixed"))
##Convert to lda model
UKlda2 <- convert(UKdfm2, to = "lda")
##run model
UKmod2 <- lda.collapsed.gibbs.sampler(UKlda2$documents, K = 15, UKlda2$vocab,
num.iterations = 1500, alpha = .1,eta = .01, initial = NULL, burnin
= NULL, compute.log.likelihood = TRUE, trace = 0L, freeze.topics = FALSE)