3

The TermDocumentMatrix function of the tm package is not functioning according to my understanding of the documentation. It seems to be doing processing on the terms that I have not requested.

Here is an example:

require(tm)
sentence <- "Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"
corpus <- Corpus(VectorSource(sentence))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf), 
                                                 removePunctuation = FALSE))
rownames(tdm)

We can see from the output that the punctuation has been removed, and the expression "rising...what" has been split:

 [1] "a"         "about"     "am"        "and"       "astrology" "cap"       "capricorn" "does"      "i"         "me"        "moon"      "rising"    "say"       "sun"       "that"     
[16] "what"  

In the related SO question, the issue was with the tokenizer which was removing the punctuation. However, I am using the default words tokenizer, which I don't believe does this:

> sapply(corpus, words)
      [,1]           
 [1,] "Astrology:"   
 [2,] "I"            
 [3,] "am"           
 [4,] "a"            
 [5,] "Capricorn"    
 [6,] "Sun"          
 [7,] "Cap"          
 [8,] "moon"         
 [9,] "and"          
[10,] "cap"          
[11,] "rising...what"
[12,] "does"         
[13,] "that"         
[14,] "say"          
[15,] "about"        
[16,] "me?" 

Is the observed behaviour incorrect, or what is my misunderstanding?

Community
  • 1
  • 1
James Hirschorn
  • 7,032
  • 5
  • 45
  • 53
  • 2
    try VCorpus (Volatile Corpus), for some reason this does keep the punctuation and remove from TermDocumentMatrix removePunctuation = FALSE and will work as you want. – n1tk May 07 '17 at 03:52

1 Answers1

3

You got a SimpleCorpus object, which came with tm package version 0.7 and which - according to ?SimpleCorpus -

takes internally various shortcuts to boost performance and minimize memory pressure

class(corpus)
# [1] "SimpleCorpus" "Corpus"  

Now, as help(TermDocumentMatrix) states:

Available local options are documented in termFreq and are internally delegated to a termFreq call. This is different for a SimpleCorpus. In this case all options are processed in a fixed order in one pass to improve performance. It always uses the Boost Tokenizer (via Rcpp)...

So you are not using words as tokenizer, which would indeed give you

words(sentence)
 [1] "Astrology:"    "I"             "am"            "a"             "Capricorn"     "Sun"           "Cap"          
 [8] "moon"          "and"           "cap"           "rising...what" "does"          "that"          "say"          
[15] "about"         "me?"  

As stated in the comments, you could make your corpus explicitly a Volatile ?VCorpus to gain back full flexibility:

A volatile corpus is fully kept in memory and thus all changes only affect the corresponding R object

corpus <- VCorpus(VectorSource(sentence)) 
Terms(TermDocumentMatrix(corpus, control = list(tokenize="words"))
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • 1
    Now `VCorpus` gives me the desired behaviour, as you explained. However, with a "real" corpus it is huge: 1.4Gb versus 36Mb for a `SimpleCorpus` and also about 50% slower to work with. – James Hirschorn May 08 '17 at 02:57