2

I've recently been working on trying to find the word frequency within a single column in a data.frame in R using the tm package. While the data.frame itself has many columns that are both numeric and character based, I'm only interested in a single column that is pure text. While I haven't had a problem cleaning up the text itself, as soon as I try to pull the word frequency with the findFreqTerms() command, I get the following error:

Error: inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")) is not TRUE

I took this to say that I needed to convert my data into either a DocumentTermMatrix or a TermDocumentMatrix, however since I only have a single column that I'm working with, I also can't create either. Error below:

> Test <- DocumentTermMatrix(Types)
Error in UseMethod("TermDocumentMatrix", x) : 
  no applicable method for 'TermDocumentMatrix' applied to an object of class "c('PlainTextDocument', 'TextDocument')"

Is there any way to get a frequency count from the single column? I've pasted my full code below with explainations for each step I took. I appreciate any help you all can give me.

> # extracting the single column I wish to analyse from the data frame
  Types <-Expenses$Types
> # lower all cases
  Types <- tolower(Types)
> # remove punctuation
  Types <- removePunctuation(Types)
> # remove numbers
  Types <- removeNumbers(Types)
> # attempting to find word frequency
  findFreqTerms(Types)
Error: inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")) is not TRUE
lawyeR
  • 7,488
  • 5
  • 33
  • 63
Aenderung
  • 23
  • 1
  • 5

2 Answers2

5

You can find the frequency of terms directly from your text variable if you use the qdap package:

library(qdap)
a <- c("hello man", "how's it going", "just fine", "really fine", "man o man!")
a <- tolower(a)
a <- removePunctuation(a)
a <- removeNumbers(a)
freq_terms(a) # there are several additional arguments
  WORD   FREQ
1 man       3
2 fine      2
3 going     1
4 hello     1
5 hows      1
6 it        1
7 just      1
8 o         1
9 really    1
lawyeR
  • 7,488
  • 5
  • 33
  • 63
3

You need a corpus and term document matrix first...

library(tm)
a <- c("hello man", "how's it going", "just fine")
a <- tolower(a)
a <- removePunctuation(a)
a <- removeNumbers(a)
myCorpus <- Corpus(VectorSource(a))
myTDM <- TermDocumentMatrix(myCorpus)
findFreqTerms(myTDM)
cory
  • 6,529
  • 3
  • 21
  • 41