RWeka remove Sparse terms

Question

I am creating a trigram and quadgram model using RWeka. There is an odd behavior I notice For the trigram

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(docs, control = list(tokenize = TrigramTokenizer))

> dim(tdm)
[1] 1540099       3

> tdm
<<TermDocumentMatrix (terms: 1540099, documents: 3)>>
 Non-/sparse entries: 1548629/3071668
 Sparsity           : 66%
 Maximal term length: 180
Weighting          : term frequency (tf)

When I remove sparse terms it shrinks the above ~1 million rows to 8307

 > b <- removeSparseTerms(tdm, 0.66) 
 > dim(b)
 [1] 8307    3

For a Quadgram removal does not affect it at all

 quadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
  tdm <- TermDocumentMatrix(docs, control = list(tokenize = QuadgramTokenizer))

 <<TermDocumentMatrix (terms: 1427403, documents: 3)>>
 Non-/sparse entries: 1427936/2854273
 Sparsity           : 67%
 Maximal term length: 185
 Weighting          : term frequency (tf)
> dim(tdm)
[1] 1427403       3
> tdm <- removeSparseTerms(tdm, 0.67)
> dim(tdm)
[1] 1427403       3

Has 1 million items after removal of sparse terms.

This does not look right.

Please let me know if I am doing something wrong

Regards Ganesh

score 0 · Answer 1 · answered Aug 11 '15 at 14:00

This is weird. A logical behaviour is that removing sparse terms will remove a lot in both cases, as trigrams and quadgrams are less common single gram cases. Do you have any other QuadgramTokenizer object in your session? your original function is called with a small "q" quadgramTokenize. But I am wondering why it did not show an error, it might have taken it as empty? I think it must be something as simple as this. Check this out and if not I ll run it with a data sample and see what could be wrong here.

RWeka remove Sparse terms

1 Answers1