1

I am new to R and I'm trying to create term document matrix with a csv file. But the results show that some of the words are missing the letter "e" in the end. How can I make the term document matrix showing the full words? It will be great if you could also let me know when you see a part that doesn't look right. Thank you!

library(tm)
posts<-read.csv("/abcd.csv",header=TRUE)
require(tm)
posts<-Corpus(VectorSource(posts))
library(SnowballC)
Corpus<-tm_map(Corpus,content_transformer(tolower))
Corpus<-tm_map(Corpus,stripWhitespace)
Corpus<-tm_map(Corpus,removeWords,stopwords("english"))
Corpus<-tm_map(Corpus,stemDocument)
inspect(Corpus[9])
tdm<-TermDocumentMatrix(Corpus)
tdm
tdm=as.matrix(TermDocumentMatrix(Corpus,control=list(wordLengths=c(1,Inf))))
tdm
rowSums(tdm)

Below are some of the words I'm seeing here as the results from the file.

caus
downtim
failur
outag
unreachabl

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Amelia
  • 11
  • 2
  • 1
    Run the script line by line. Let us know when you find the line that is dropping the last letter. It's difficult to say much more without a sample data set. – manotheshark Apr 06 '17 at 18:12
  • Try skipping the line `Corpus<-tm_map(Corpus,stemDocument)` and then run the rest of your script. Do you see a difference? ;P – alvas Apr 07 '17 at 09:20

1 Answers1

2

Because you are using stemming.

Stemming usually results in the last few characters being removed.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194