8

I have two following DTM-s:

dtm <- DocumentTermMatrix(t)

dtmImproved <- DocumentTermMatrix(t, 
               control=list(minWordLength = 4, minDocFreq=5))

When I implement this, I see two equal DTM-s and if I open the dtmImproved, there are words with 3 symbols. Why doesn't the minWordLength parameter work? Thank you!

> dtm
A document-term matrix (591 documents, 10533 terms)

Non-/sparse entries: 43058/6181945
Sparsity           : 99%
Maximal term length: 135 
Weighting          : term frequency (tf)
> dtmImproved
A document-term matrix (591 documents, 10533 terms)

Non-/sparse entries: 43058/6181945
Sparsity           : 99%
Maximal term length: 135 
Weighting          : term frequency (tf)
Artem Sultan
  • 459
  • 4
  • 10

2 Answers2

25
dtmImproved <- DocumentTermMatrix(t, control=list(wordLengths=c(4, 15), 
                                   bounds = list(global = c(5,Inf))))

This solves the problem! The lack of proper documentation really mads me down (:

mnel
  • 113,303
  • 27
  • 265
  • 254
Artem Sultan
  • 459
  • 4
  • 10
  • 1
    Which version of `tm` are you using. The help for `TermDocumentMatrix` sets out the global options and gives a link to the local options. `minWordLength` is never listed as an option, but `wordLengths` is described in detail. The documentation appears well written and easy to follow. – mnel Nov 13 '12 at 23:55
  • Yep, that was the one that helped, unfortunately couldn't find by googling, but it's more like my fault ) – Artem Sultan Nov 14 '12 at 17:34
  • 1
    @mnel : it silently ignores any parameter it doesn't recognize, even e.g. `(control=list( bounds=list(c(0,Inf))) )` instead of `(control=list( bounds=list(global=c(0,Inf))) )`. This is a big pain. Did you spot the missing label 'global'? I didn't... – smci Jun 27 '15 at 01:53
0

It is always a good idea to read the source code if available. Read the Source code of the wordcloud function@GitHub, here is what it says:
# Author: ianfellows
.....
if(min.freq > max(freq))
min.freq <- 0

So your DocumentTermMatrix, returned a max(freq) < min.freq bound that you set, i.e. non-of the terms appeared in more than your min.freq bound that you set.

Hope this Helps MJJ

MJJ
  • 61
  • 1
  • 6