The TermDocumentMatrix
function of the tm
package is not functioning according to my understanding of the documentation. It seems to be doing processing on the terms that I have not requested.
Here is an example:
require(tm)
sentence <- "Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"
corpus <- Corpus(VectorSource(sentence))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf),
removePunctuation = FALSE))
rownames(tdm)
We can see from the output that the punctuation has been removed, and the expression "rising...what" has been split:
[1] "a" "about" "am" "and" "astrology" "cap" "capricorn" "does" "i" "me" "moon" "rising" "say" "sun" "that"
[16] "what"
In the related SO question, the issue was with the tokenizer which was removing the punctuation. However, I am using the default words
tokenizer, which I don't believe does this:
> sapply(corpus, words)
[,1]
[1,] "Astrology:"
[2,] "I"
[3,] "am"
[4,] "a"
[5,] "Capricorn"
[6,] "Sun"
[7,] "Cap"
[8,] "moon"
[9,] "and"
[10,] "cap"
[11,] "rising...what"
[12,] "does"
[13,] "that"
[14,] "say"
[15,] "about"
[16,] "me?"
Is the observed behaviour incorrect, or what is my misunderstanding?