1

I'm currently using the tm package to extract out terms to cluster on for duplicate detection in a decently sized database of 25k items (30Mb) this runs on my desktop, but when I try to run it on my server It seems to take an ungodly amount of time. On closer inspection I found that I had blown through 4GB of swap running the line apply(posts.TmDoc, 1, sum) to calculate the frequencies of the terms. Furthermore even running as.matrix generates a document of 3GB on my desktop see https://i.stack.imgur.com/yCqVf.jpg

Is this necessary just to generate a frequency count for 18k terms on 25k items? Is there any other way to generate the frequency count without coercing the TermDocumentMatrix to a matrix or a vector?

I cannot remove terms based on sparseness as that's how the actual algorithim is implemented. It looks for terms that are common to at least 2 but not more than 50 and groups on them, calculating a similarity value for each group.

Here is the code in context for reference

min_word_length = 5
max_word_length = Inf
max_term_occurance = 50
min_term_occurance = 2


# Get All The Posts
Posts = db.getAllPosts()
posts.corpus = Corpus(VectorSource(Posts[,"provider_title"]))

# remove things we don't want
posts.corpus = tm_map(posts.corpus,content_transformer(tolower))
posts.corpus = tm_map(posts.corpus, removePunctuation)
posts.corpus = tm_map(posts.corpus, removeNumbers)
posts.corpus = tm_map(posts.corpus, removeWords, stopwords('english'))

# grab any words longer than 5 characters
posts.TmDoc = TermDocumentMatrix(posts.corpus, control=list(wordLengths=c(min_word_length, max_word_length)))

# get the words that occur more than once, but not more than 50 times
clustterms = names(which(apply(posts.TmDoc, 1, sum) >= min_term_occurance  & apply(posts.TmDoc, 1, sum) < max_term_occurance))
Matt Bucci
  • 2,100
  • 2
  • 16
  • 22
  • Do the math: `18e3 * 25e3 * 8 / 1024^3` gives 3.3GB. So yes, this is the memory consumption of a matrix. Use sparse matrices instead. – Andrie Dec 08 '14 at 10:32
  • Your question is similar to http://stackoverflow.com/questions/14426925/frequency-per-term-r-tm-documenttermmatrix – Andrie Dec 08 '14 at 10:34
  • @Andrie , the Sparse approach seems to still need to convert it to a regular matrix before actually generating the sparse matrix unfortunately. Once it's converted it's about 800KB, but up till that conversion it's in memory. I'm going to try the line by line approach found in your second link using inspect to extract 1 row at a time, and then storing the result of rowSums into a named list – Matt Bucci Dec 08 '14 at 10:54
  • @Andrie, both previous answers actually ended up not being the most efficient/elegrant solution. I've submitted an answer – Matt Bucci Dec 08 '14 at 11:08

1 Answers1

3

Because I never actually need the frequency counts I can use the findFreqTerms command

setdiff(findFreqTerms(posts.TmDoc, 2), findFreqTerms(posts.TmDoc, 50))

is the same as

names(which(apply(posts.TmDoc, 1, sum) >= min_term_occurance  & apply(posts.TmDoc, 1, sum) < max_term_occurance))

but runs near instantaneously

Matt Bucci
  • 2,100
  • 2
  • 16
  • 22