1

I'm trying to feed a corpus into DocumentTermMatrix (I shorthand as DTM) to get term frequencies, but I noticed that DTM doesn't keep all terms and I don't know why! Check it out:

A<-c(" 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107")
B<-c(" 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107")
C<-Corpus(VectorSource(c(A,B)))
inspect(C)

>A corpus with 2 text documents
>
>The metadata consists of 2 tag-value pairs and a data frame
>Available tags are:
>  create_date creator 
>Available variables in the data frame are:
>  MetaID 
>
>[[1]]
> 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107
>
>[[2]]
> 95 94 89 91 90 102 103 100 101 98 99 97 110 108 109 106 107

So far so good.

But now, I try to feed C into the DTM and it doesn't come out the other end! See:

> dtm<-DocumentTermMatrix(C)
> colnames(dtm)
>[1] "100" "101" "102" "103" "106" "107" "108" "109" "110"

Where are all the results less than 100? Or is it somehow a 2 character thing? I also tried:

dtm<-DocumentTermMatrix(C,control=list(c(1,Inf)))

and

dtm<-TermDocumentMatrix(C,control=list(c(1,Inf)))

to no avail. What gives?

Amit Kohli
  • 2,860
  • 2
  • 24
  • 44

1 Answers1

3

If you read the ?TermDocumentMatrix help page you can see that additional control= options are listed in in the ?termFreq help page.

There is a wordLengths parameter which filters the length of the words used in the matrix. It defaults to c(3,Inf) so it excludes two-character words. Try setting the value to control=list(wordLengths=c(2,Inf) to include those short words. (Note that when passing control parameters, you should name the parameters in the list.)

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Yup... that solved it. I HAD checked ?DocumentTermMatrix, but in RStudio it says nothing about wordlengths at all! Is there some way to get more complete info about a command? – Amit Kohli Jun 24 '14 at 14:17
  • @AmitKohli Like I said, the ?DocumentTermMatrix has a description which points to the ?termFreq page. It is common in R when one top level function calls a lower level function not to repeat all the parameter for that function on the help page but just to point you to that page. You just need to read all the sections and follow the links. The fact that you were setting a control= value at all tells me you were at least close. – MrFlick Jun 24 '14 at 14:40
  • I see it now. Indeed, thank you... I didn't know that extra info was tucked away in the extra links! – Amit Kohli Jun 24 '14 at 14:48