So, I am analyzing a huge corpus. It has about 40,000 documents in it. I am trying to analyze it using R's tm package. I created a document term matrix that reads 100% Sparsity, meaning that there are no common words in this corpus.
library(qdap)
library(SnowballC)
library(dplyr)
library(tm)
docs <-Corpus((DirSource(cname)))
docs <-tm_map(docs,content_transformer(tolower))
docs <-tm_map(docs,content_transformer(removeNumbers))
docs <-tm_map(docs,content_transformer(removePunctuation))
docs <-tm_map(docs,removeWords,stopwords("english"))
#docs<- tm_map(docs, stemDocument)
dtm <- DocumentTermMatrix(docs)
<<DocumentTermMatrix (documents: 39373, terms: 108065)>>
Non-/sparse entries: 2981619/4251861626
Sparsity : 100%
Maximal term length: 93
Weighting : term frequency (tf)
I removed all of the infrequent words and I got this:
dtms <- removeSparseTerms(dtm, 0.1)
dim(dtms)
[1] 39373 0
Is R acting this way because my corpus is too big?
Update
So I have been doing quite a bit of searching on this issue. This seems to be a parallel computing problem. I'm not really sure. But I have stumbled upon these hand outs that talk about distributed text mining in r Link
More Updates*
So I guess my question was a duplicate question I found the answer in various places. One was in the kaggle website for data science competitions. The other two answers are here at stackoverflow link and another link. I hope this helps. Also there is great examples of the tm package on the hands on data science site and great documentation of text processing in R at gaston's page as well.