0

For a research project I am working on, I have read pdf documents into R, created a corpus and a TermDocumentMatrix. I want to check the frequency of specific words in each document in my corpus. The code below gives me the kind of matrix I want, with the frequency of words by document, but obviously it only does high frequency terms not specific terms.

ft <- findFreqTerms(tdm, lowfreq = 100, highfreq = Inf)
as.matrix(opinions.tdm[ft,])

I found the code below in another comment, which allows for searching the frequency of specific terms, however, it sums across the documents. How do I adapt this so that I am searching for the specific terms but within each document rather than across?

library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))


tdm <- TermDocumentMatrix(crude)

# turn tdm into dense matrix and create frequency vector. 
freq <- rowSums(as.matrix(tdm))
freq["crude"]
crude 
   21 
freq["oil"]
oil 
 85 

1 Answers1

0

Skip the rowSums part and just refer to the matrix

term_matrix <-as.matrix(tdm)
term_matrix["crude",]
# 127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 
#   2   0   2   3   0   2   0   0   0   0   5   2   0   2   0   0 
# 502 543 704 708 
#   0   2   0   1 
term_matrix["oil",]
# 127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 
#   5  12   2   1   1   7   3   3   5   9   5   4   5   4   3   4 
# 502 543 704 708 
#   5   3   3   1 
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Thank you @MrFlick that is very helpful! Is there any way I can search for the separate terms at the same time? The previous example used the following command, which is what I want merged with what you have provided me with above. `# separate words freq[c("crude", "oil")] crude oil 21 85 ` – Sarah R Hall Jul 08 '20 at 22:10
  • You can use `term_matrix[c("crude", "oil"),]` (note the extra comma in there). That will return counts for each word for each document. To combine them you could do `colSums(term_matrix[c("crude", "oil"),])` – MrFlick Jul 09 '20 at 02:58