TermDocumentMatrix doing unrequested cleaning (e.g. removing punctuation)

Question

The TermDocumentMatrix function of the tm package is not functioning according to my understanding of the documentation. It seems to be doing processing on the terms that I have not requested.

Here is an example:

require(tm)
sentence <- "Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"
corpus <- Corpus(VectorSource(sentence))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf), 
                                                 removePunctuation = FALSE))
rownames(tdm)

We can see from the output that the punctuation has been removed, and the expression "rising...what" has been split:

 [1] "a"         "about"     "am"        "and"       "astrology" "cap"       "capricorn" "does"      "i"         "me"        "moon"      "rising"    "say"       "sun"       "that"     
[16] "what"

In the related SO question, the issue was with the tokenizer which was removing the punctuation. However, I am using the default words tokenizer, which I don't believe does this:

> sapply(corpus, words)
      [,1]           
 [1,] "Astrology:"   
 [2,] "I"            
 [3,] "am"           
 [4,] "a"            
 [5,] "Capricorn"    
 [6,] "Sun"          
 [7,] "Cap"          
 [8,] "moon"         
 [9,] "and"          
[10,] "cap"          
[11,] "rising...what"
[12,] "does"         
[13,] "that"         
[14,] "say"          
[15,] "about"        
[16,] "me?"

Is the observed behaviour incorrect, or what is my misunderstanding?

try VCorpus (Volatile Corpus), for some reason this does keep the punctuation and remove from TermDocumentMatrix removePunctuation = FALSE and will work as you want. — n1tk, May 07 '17 at 03:52

lukeA · Accepted Answer · 2017-05-07T14:25:38.077

You got a SimpleCorpus object, which came with tm package version 0.7 and which - according to ?SimpleCorpus -

takes internally various shortcuts to boost performance and minimize memory pressure

class(corpus)
# [1] "SimpleCorpus" "Corpus"

Now, as help(TermDocumentMatrix) states:

Available local options are documented in termFreq and are internally delegated to a termFreq call. This is different for a SimpleCorpus. In this case all options are processed in a fixed order in one pass to improve performance. It always uses the Boost Tokenizer (via Rcpp)...

So you are not using words as tokenizer, which would indeed give you

words(sentence)
 [1] "Astrology:"    "I"             "am"            "a"             "Capricorn"     "Sun"           "Cap"          
 [8] "moon"          "and"           "cap"           "rising...what" "does"          "that"          "say"          
[15] "about"         "me?"

As stated in the comments, you could make your corpus explicitly a Volatile ?VCorpus to gain back full flexibility:

A volatile corpus is fully kept in memory and thus all changes only affect the corresponding R object

corpus <- VCorpus(VectorSource(sentence)) 
Terms(TermDocumentMatrix(corpus, control = list(tokenize="words"))

Now `VCorpus` gives me the desired behaviour, as you explained. However, with a "real" corpus it is huge: 1.4Gb versus 36Mb for a `SimpleCorpus` and also about 50% slower to work with. — James Hirschorn, May 08 '17 at 02:57

TermDocumentMatrix doing unrequested cleaning (e.g. removing punctuation)

1 Answers1