I have a group of text files from several journals (let's call them journal A and journal B) that I am trying to run LDA on. I divide them each into their own corpus, then attach the names of the files to each corpus, store the journal of origin under the origin
label, and finally, combine the two corpuses into myCorpus
:
library(tm); library(topicmodels);
txtfolder <- "~/Path/to/txtfiles/"
source <- DirSource(txtfolder)
A.names <- list.files(path=txtfolder, pattern="A")
B.names <- list.files(path=txtfolder, pattern="B")
A.names <- lapply(X=A.names, FUN=function(i){gsub(".txt", '', x=i)})
B.names <- lapply(X=B.names, FUN=function(i){gsub(".txt", '', x=i)})
A.corpus <- Corpus(A.source, readerControl=list(reader=readPlain))
for (i in 1:length(A.corpus)){
meta(A.corpus[[i]], tag = "origin") <- "A"
}
B.corpus <- Corpus(B.source, readerControl=list(reader=readPlain))
for (i in 1:length(B.corpus)){
meta(B.corpus[[i]], tag = "origin") <- "B"
}
myCorpus <- c(A.corpus, B.corpus) # combining the two corpuses
From here I run LDA on myCorpus
:
myCorpus <- tm_map(myCorpus, PlainTextDocument)
dtm <- DocumentTermMatrix(myCorpus, control = list(minWordLength=3))
n.topics <- 5
lda.model <- LDA(dtm, n.topics)
terms(lda.model,10)
From here I would like to create a plot measuring the proportion of each journal that is due to a particular topic over time (I can identify the time that each issue of the journals was published by parsing the txt files, and store them in a vector similarly to how I did with the origin
tag). I'm not sure how best to store this information so that I can use date published as the horizontal axis. More importantly, how can I create the graph that I mentioned?