2

I have a group of text files from several journals (let's call them journal A and journal B) that I am trying to run LDA on. I divide them each into their own corpus, then attach the names of the files to each corpus, store the journal of origin under the origin label, and finally, combine the two corpuses into myCorpus:

library(tm); library(topicmodels);

txtfolder <- "~/Path/to/txtfiles/"
source <- DirSource(txtfolder)
A.names <- list.files(path=txtfolder, pattern="A")
B.names <- list.files(path=txtfolder, pattern="B")
A.names <- lapply(X=A.names, FUN=function(i){gsub(".txt", '', x=i)})
B.names <- lapply(X=B.names, FUN=function(i){gsub(".txt", '', x=i)})
A.corpus <- Corpus(A.source, readerControl=list(reader=readPlain))
for (i in 1:length(A.corpus)){
  meta(A.corpus[[i]], tag = "origin") <- "A"
}
B.corpus <- Corpus(B.source, readerControl=list(reader=readPlain))
for (i in 1:length(B.corpus)){
  meta(B.corpus[[i]], tag = "origin") <- "B"
}
myCorpus <- c(A.corpus, B.corpus) # combining the two corpuses

From here I run LDA on myCorpus:

myCorpus <- tm_map(myCorpus, PlainTextDocument)
dtm <- DocumentTermMatrix(myCorpus, control = list(minWordLength=3))
n.topics <- 5
lda.model <- LDA(dtm, n.topics)
terms(lda.model,10)

From here I would like to create a plot measuring the proportion of each journal that is due to a particular topic over time (I can identify the time that each issue of the journals was published by parsing the txt files, and store them in a vector similarly to how I did with the origin tag). I'm not sure how best to store this information so that I can use date published as the horizontal axis. More importantly, how can I create the graph that I mentioned?

mlinegar
  • 1,389
  • 1
  • 11
  • 19

1 Answers1

1

I assume you want to use ggplot since you added this tag. You will first need to gather your data in a data frame.

Assuming lda is the output of LDA() and corpus is the corpus, you'll find the topics with topics(lda) and the various informations on the documents using meta(corpus). You might need to adjust this to your corpus :

df <- data.frame(id=names(topics(lda)),
                 topic=topics(lda),
                 date=as.POSIXct(unlist(lapply(meta(corpus,type="local",tag="datetimestamp"),as.character))),
                 origin=unlist(meta(corpus,type="local",tag="origin"))    )

Then, you need to compute the stats you want to plot : the frequency of each topic by date and by origin, and

library(dplyr)
library(tidyr)
M <- df %>% gather(key,value,topic) %>%
  group_by(date,origin,value) %>%
  summarize(n=n()) %>%
  mutate(f=n/sum(n))

Finally, in order to plot it :

library(ggplot2)

ggplot(data=M,aes(x=date,fill=factor(value),y=f)) + 
  geom_bar(stat="identity",position="stack") +
  facet_grid(~origin)

Here is what it gives with simulated data

set.seed(100)
df <- data.frame(date=sample(seq.Date(as.Date("2015-07-27 10:12:25"),as.Date("2015-07-31 10:12:25"),by=1),100,replace=TRUE),
       id=1:100,
       topic=sample(1:5,100,replace=TRUE),
       origin=sample(c("A","B"),100,replace=TRUE))

enter image description here

scoa
  • 19,359
  • 5
  • 65
  • 80
  • This is great @scoa ! There's just one thing I haven't been able to figure out - when I use `topics(lda)`, I get only a single topic for each issue of A and B - how can I get that topic makeup? Also, how could I measure the proportion of each journal due to a single topic (in a line graph probably)? I'm mostly just unsure how to isolate a single topic. – mlinegar Jul 28 '15 at 00:26
  • 1
    @mlinegar you can access those with `posterior(lda)$topics`. `cbind(df,posterior(lda)$topics)` should work. Your second question is too much for the comments. Could you edit your question or ask a new one with the output you want? – scoa Jul 28 '15 at 07:35
  • you're definitely right that I need to revisit the second question I posted, but I've gotten stuck working with the dataframes on the way there. With `df` defined as you did, I then use `td <- posterior(lda)$topics` and `dfr <- cbind(df,td)` which gives me `Warning message: In data.row.names(row.names, rowsi, i) : some row.names duplicated: 2,3, ...`. If I ignore the error and graph anyway, I get the same problem I did earlier. Again, thank you so much for your help! – mlinegar Jul 28 '15 at 21:39