3

Given three TermDocumentMatrix, text1, text2 and text3, I'd like to calculate word frequency for each of them into a data frame and rbind all the data frames. Three are sample - I have hundreds in reality so I need to functionalize this.

It's easy to calculate word freq for one TDM:

apply(x, 1, sum)

or

rowSums(as.matrix(x))

I want to make a list of TDMs:

tdm_list <- Filter(function(x) is(x, "TermDocumentMatrix"), mget(ls()))

and calculate word freq for each and put it in a data frame:

data.frame(lapply(tdm_list, sum)) # this is wrong. it simply sums frequency of all words instead of frequency by each word.

and then rbind it all:

do.call(rbind, df_list)

I can't figure out how to use lapply on a TDM to calculate word frequency.

Adding sample Data to play around with :

require(tm)
text1 <- c("apple" , "love", "crazy", "peaches", "cool", "coke", "batman", "joker")
text2 <- c("omg", "#rstats" , "crazy", "cool", "bananas", "functions", "apple")
text3 <- c("Playing", "rstats", "football", "data", "coke", "caffeine", "peaches", "cool")

tdm1 <- TermDocumentMatrix(Corpus(VectorSource(text1)))
tdm2 <- TermDocumentMatrix(Corpus(VectorSource(text2)))
tdm3 <- TermDocumentMatrix(Corpus(VectorSource(text3)))
vagabond
  • 3,526
  • 5
  • 43
  • 76
  • Can you show the actual sample list of tdm? – Metrics Mar 18 '15 at 19:43
  • Why not use findFreqTerms from the tm package? – lawyeR Mar 18 '15 at 19:45
  • I have many TDMs - let's say 100 of them. I want to calculate the frequency of each word withing all of them and put it in a data frame. Then i want to rbind all the data frames. The resulting data frame will encapsulate the frequency of words for each of the TDMs. – vagabond Mar 18 '15 at 19:49
  • Because you are not using code to offer an example for people to work with. It doesn't need to be 100's of documents, just 3 or 4. My guess, and it is only a guess since I am not a down-voter, is that people are annoyed that you expect them to construct examples for you. Do an SO search on `[r] great reproducible example` if you need advice on how to do this. – IRTFM Mar 18 '15 at 19:56
  • Ok, I can correct that. I'll add some code to contruct TDMs using some text. – vagabond Mar 18 '15 at 19:58
  • Seems like `lapply(tdm_list, rowSums)` would work – Rich Scriven Mar 18 '15 at 20:15
  • no @RichardScriven , I get `Error in FUN(X[[1L]], ...) : 'x' must be an array of at least two dimensions` I tried this ! – vagabond Mar 18 '15 at 20:18
  • I was trying to not make a list of the TDMs and loop through them but that won't work either. `for (i in c(tdm1, tdm2, tdm3)) { apply(i, 1, sum) }` returns `Error in apply(i, 1, sum) : dim(X) must have a positive length` – vagabond Mar 18 '15 at 21:11

1 Answers1

2

Ok I think I have it and this might actually help someone looking to do the same thing. It was simple in the end.

combineddf <- do.call(rbind, lapply(tdm_list, function (x) {
 data.frame(apply(x, 1, sum))
}))

the above takes a list of TermDocumentMatrices and gives word count for all of them in dataframes and rbinds everything.

vagabond
  • 3,526
  • 5
  • 43
  • 76