4

So, I am analyzing a huge corpus. It has about 40,000 documents in it. I am trying to analyze it using R's tm package. I created a document term matrix that reads 100% Sparsity, meaning that there are no common words in this corpus.

library(qdap)
library(SnowballC)
library(dplyr)
library(tm)

docs <-Corpus((DirSource(cname)))

docs <-tm_map(docs,content_transformer(tolower))
docs <-tm_map(docs,content_transformer(removeNumbers))
docs <-tm_map(docs,content_transformer(removePunctuation))
docs <-tm_map(docs,removeWords,stopwords("english"))
#docs<- tm_map(docs, stemDocument)
dtm <- DocumentTermMatrix(docs)

<<DocumentTermMatrix (documents: 39373, terms: 108065)>>
Non-/sparse entries: 2981619/4251861626
Sparsity           : 100%
Maximal term length: 93
Weighting          : term frequency (tf)

I removed all of the infrequent words and I got this:

dtms <- removeSparseTerms(dtm, 0.1)
dim(dtms)
[1] 39373     0

Is R acting this way because my corpus is too big?

Update

So I have been doing quite a bit of searching on this issue. This seems to be a parallel computing problem. I'm not really sure. But I have stumbled upon these hand outs that talk about distributed text mining in r Link

More Updates*

So I guess my question was a duplicate question I found the answer in various places. One was in the kaggle website for data science competitions. The other two answers are here at stackoverflow link and another link. I hope this helps. Also there is great examples of the tm package on the hands on data science site and great documentation of text processing in R at gaston's page as well.

Community
  • 1
  • 1
Zaynaib Giwa
  • 5,366
  • 7
  • 21
  • 26
  • 1
    What exactly is unexpected here? What were you trying to do with `removeSparseTerms`? If there are no overlapping terms in documents, each term would have a sparsity of 1/39373 which is much smaller than 0.1 so you're removing them all. – MrFlick Dec 10 '14 at 05:39
  • @MrFlick forgive because I a completely new to the world of text mining. I did not expect the corpus to be 100% sparsity. I was expect terms with frequency of age|1.97657922 agglomeration|2.57863921 aggregates|2.57863921. – Zaynaib Giwa Dec 10 '14 at 05:47
  • How certain are you that there are overlapping terms in your documents? Are you sure you are creating your corpus correctly? You haven't exactly created a reproducible problem here so we have no idea what's in your data. – MrFlick Dec 10 '14 at 05:49
  • @MrFlick You maybe right. I think I might be using the wrong package. I am checking out the [slam package](http://cran.r-project.org/web/packages/slam/slam.pdf) that deals with these sort of problems – Zaynaib Giwa Dec 10 '14 at 06:20
  • Can you tell me the total number of words in the corpus? There are 108k unique terms, the total number of words should give a fair estimate of the sparsity. – jackStinger Dec 10 '14 at 07:04
  • @jackStinger I am not able to tell you the total number of words in the corpus. I get an error because of the sparsity. I've been doing some more digging into the problem and it seems like this is a parallel computing problem. Unfortunately my teacher failed to teach us anything about hadoop or hive. – Zaynaib Giwa Dec 10 '14 at 07:18
  • Where and how is your data stored? And, how large is your data? – jackStinger Dec 10 '14 at 10:10
  • @jackStinger My data is 151 MB and its on my laptop. – Zaynaib Giwa Dec 10 '14 at 15:24
  • you shouldn't ideally need parallel computing for 151 mb of data- I've worked with far larger numbers in-memory in R. Is it possible to share the data- I can take a look at it. – jackStinger Dec 11 '14 at 07:11
  • Can you run rowsum(as.matrix(dtm)) on your dtm and plot the histogram? – jackStinger Dec 11 '14 at 07:15
  • @jackStinger here is a link to my data [Link](https://www.dropbox.com/s/l9ea74zkkn8pct3/clean1%20%281%29.zip?dl=0) – Zaynaib Giwa Dec 11 '14 at 12:55
  • @jackStinger I fear the only way to do this is with python. – Zaynaib Giwa Dec 12 '14 at 15:06
  • @jackStinger so like I almost cracked it. I will put up an updated post right now – Zaynaib Giwa Dec 13 '14 at 04:28
  • `removeSparseTerms(..., 0.1)` is applying a (pretty large) 10% threshold(!), not a 0.1% threshold. Reduce the threshold until you get a non-empty output, then tell us what threshold value that was. – smci Jul 11 '16 at 21:09

0 Answers0