-1

I built a corpus in R by the use of tm package. I want to change the frequency boundaries and only keep the words which are repeated at least 4 times in the entire document. After that, I need to build document-term-matrix based on these terms.

'Data' is a 45k by 2 matrix. First column is 'Text' which includes on average 10 words in each row. Second column is 'Code' which includes a 5-digit code for each row.

Almost 15k words in 'Text' are repeated once or twice. I want to remove them then build the document-term-matrix.

Here is the code I tried:

MyCorpus <- Corpus(VectorSource(Data$Text))
MyCorpus <- tm_map(MyCorpus , removeWords, stopwords('english'))
MyCorpus  <- tm_map(MyCorpus , stripWhitespace)
MyCorpus  <- termFreq(MyCorpus  , control = list(local = c(4, Inf)))

But I faced this error in line 4:

Error: inherits(doc, "TextDocument") is not TRUE

What should I do?

user36729
  • 545
  • 5
  • 30
  • 1
    It is expected that you provide sample data with your posts. We don't have access to `Data` and can be of little help. Reading this will get you started with asking questions in ways that will get quality respnses: http://stackoverflow.com/help/how-to-ask – Tyler Rinker May 24 '15 at 01:24
  • @TylerRinker I explained Data's structure. I hope it's helpful. – user36729 May 24 '15 at 01:41
  • Using @TylerRinker 's package, qdap, have you tried something like freq_terms(Data$Text, top = 20, at.least = 4, stopwords = Top200Words) # works with a text vector – lawyeR May 24 '15 at 02:19
  • I downvoted as you did not appear to read the instructions above. A description of data is minimaly helpful, worse it puts the burden of making the problem reproducible on the people providing assistance (as Mr. Flick did below). Not knowing is excusable but now that you know you do not have an excuse not to take the time to make a minimal reproducible exampe: http://stackoverflow.com/help/mcve – Tyler Rinker May 24 '15 at 04:20

1 Answers1

2

termFreq is meant to be used on a document, not a corpus. If you want to filter on frequency when building your DocumentTermMatrix, you use the DocumentTermMatrix function

DTM  <- DocumentTermMatrix(MyCorpus  , control = list(bounds=list(global = c(4, Inf))))

Here's an example...

library(tm)

Data<-data.frame(Text=c("aaa bbb aaa ddd","bbb aaa aaa bbb ccc","bbb aaa aaa bbb ddd", "aaa bbb ddd"))

MyCorpus <- Corpus(VectorSource(Data$Text))
MyCorpus <- tm_map(MyCorpus , removeWords, stopwords('english'))
MyCorpus  <- tm_map(MyCorpus , stripWhitespace)
DTM  <- DocumentTermMatrix(MyCorpus , control = list(bounds = list(global=c(2, Inf))))

inspect(DTM)

# <<DocumentTermMatrix (documents: 4, terms: 3)>>
# Non-/sparse entries: 11/1
# Sparsity           : 8%
# Maximal term length: 3
# Weighting          : term frequency (tf)

#     Terms
# Docs aaa bbb ddd
#    1   2   1   1
#    2   2   2   0
#    3   2   2   1
#    4   1   1   1

Here we used the global bounds to make sure we only keep words that appeared in at least two documents. You can also set a local bound to require words to appears a certain number of times in each document.

MrFlick
  • 195,160
  • 17
  • 277
  • 295