I appreciate the answer from Ben here: LDA with topicmodels, how can I see which topics different documents belong to?
My question is: How do I preserve the document titles in the last step? For example:
Manually create three .txt documents in separate text files and store them in directory ~Desktop/nature_corpus
First document title: nature.txt
First document content: noun the natural world, Mother Nature, Mother Earth, the environment; wildlife, flora and fauna, the countryside; the universe, the cosmos.
Second document title: conservation.txt
Second document content: noun the conservation of tropical forests: preservation, protection, safeguarding, safekeeping; care, guardianship, husbandry, supervision; upkeep, maintenance, repair, restoration; ecology, environmentalism.
Third document title: bird.txt
Third document text: noun feeding the birds: fowl; chick, fledgling, nestling; informal feathered friend, birdie; budgie; (birds) technical avifauna.
#install.packages("tm")
#install.packages("topicmodels")
library(tm)
# Create DTM
#. The file path is a Mac file path.
corpus_nature_1 <- Corpus(DirSource("/Users/[home folder name]/Desktop/nature_corpus"),readerControl=list(reader=readPlain,language="en US"))
corpus_nature_2 <- tm_map(corpus_nature_1,removeNumbers)
corpus_nature_3 <- tm_map(corpus_nature_2,content_transformer(tolower))
mystopwords <- c(stopwords(),"noun", "verb")
corpus_nature_4 <- tm_map(corpus_nature_3,removeWords, mystopwords)
corpus_nature_5 <- tm_map(corpus_nature_4,removePunctuation)
corpus_nature_6 <- tm_map(corpus_nature_5,stripWhitespace)
dtm_nature_1 <- DocumentTermMatrix(corpus_nature_6)
inspect(dtm_nature_1)
<<DocumentTermMatrix (documents: 3, terms: 42)>>
Non-/sparse entries: 42/84
Sparsity : 67%
Maximal term length: 16
Weighting : term frequency (tf)
Sample :
Terms
Docs avifauna birdie birds budgie chick feathered feeding fledgling fowl mother
bird.txt 1 1 2 1 1 1 1 1 1 0
conservation.txt 0 0 0 0 0 0 0 0 0 0
nature.txt 0 0 0 0 0 0 0 0 0 2
The topic model run with topicmodels:
# Run topic model 2 topics
library(topicmodels)
topicmodels_LDA_nature_2 <- LDA(dtm_nature_1,2,method="Gibbs",control=list(seed=1),model=NULL)
terms(topicmodels_LDA_nature_2,3)
Topic 1 Topic 2
[1,] "birds" "avifauna"
[2,] "mother" "birdie"
[3,] "chick" "budgie"
How can I retain the document titles (visible in the inspect(dtm_nature_1) line) here?
# Create CSV Matrix 2 topics
matrix_nature_2 <- as.data.frame(topicmodels_LDA_nature_2@gamma)
names(matrix_nature_2) <- c(1:2)
write.csv(matrix_nature_2,"matrix_nature_2.csv")
#. Rows in this table are documents, columns are topics.
1 2
1 0.46875 0.53125
2 0.52238806 0.47761194
3 0.555555556 0.444444444
Thanks.