2

update:

Thanks for help. Check comments. Because of package version, I delete the tolower and it works. I just need to find another way to make it lower.

============ I am doing basic txt mining in with a list of document, everything goes on fine till I try to use stemmDocument.

the tm_map I already done is as following with library(tm)

fbVec<-VectorSource(data[,1])
fbCorpus<-Corpus(fb.vec)
fbCorpus <- tm_map(fbCorpus, tolower)
fbCorpus <- tm_map(fbCorpus, removePunctuation)
fbCorpus <- tm_map(fbCorpus, removeNumbers)
fbCorpus <- tm_map(fbCorpus, removeWords, stopwords("english"))
fbCorpus <- tm_map(fbCorpus, removeWords, "pr")
fbCorpus <- tm_map(fbCorpus, stripWhitespace)

The results from it is as following

[[1]]
[1]  easy post position search resumes improvement searching resumes

[[2]]
[1]  easy use good candidiates improvement allow multiple emails sent 

[[3]]
[1]  applicants young kids absolutely sales experience waste time looking improvement applicants apply experience looking dont need kids just high school

[[4]]
[1]  abundance resumes

Then I tried to stem

library(SnowballC)    
fbCorpus <- tm_map(fbCorpus, stemDocument)

But the results is not as I image, it looks like only deal with the last word in a sentence, result as following:

[[1]]
[1]  easy post position search resumes improvement searching resum

[[2]]
[1]  easy use good candidiates improvement allow multiple emails sent 

[[3]]
[1]  applicants young kids absolutely sales experience waste time looking improvement applicants apply experience looking dont need kids just high school

[[4]]
[1]  abundance resum

Is there anyone can help?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
user3754216
  • 107
  • 1
  • 1
  • 10
  • 1
    I cannot replicate your result. Stemming with your data and code above works for me. What version of the `tm` and `SnowballC` library do you have installed? `sessionInfo() ` should tell you – MrFlick Jun 19 '14 at 17:44
  • @MrFlick [1] SnowballC_0.5 textcat_1.0-2 RTextTools_1.4.2 SparseM_1.03 tm_0.6 NLP_0.1-3 it is so strange.... – user3754216 Jun 19 '14 at 17:55
  • 2
    I ran on tm 0.5.10. I helped someone before with tm 0.6 and it changed some things. I think the problem may be `tolower`. Can you try with out that? – MrFlick Jun 19 '14 at 18:07
  • @MrFlick Oh, yes! it is tolower! I delete it and it works! Don't know why.Thx! I suppose now I just need another way to make it all lower:) – user3754216 Jun 19 '14 at 18:17
  • I've posted a workaround as an answer. Hopefully that should work. (Not sure since i'm not running 0.6 so i can't test) – MrFlick Jun 19 '14 at 18:26

3 Answers3

4

This problem appears in tm 0.6 and has to do with using functions that are not in the list of getTransformation() from tm. The problem is that tolower just returns a character vector, and not a "PlainTextDocument" like tm_map would like. The tm packages provides the content_transformer function to take care of managing the PlainTextDocument

fbCorpus  <- tm_map(fbCorpus, content_transformer(tolower))
MrFlick
  • 195,160
  • 17
  • 277
  • 295
0

You are not loading you document correctly. If you have your data in x.csv file then use following:

      > x <- read.csv(file_loc, header = TRUE) // where file_loc is the path to the csv file
      > x <- data.frame(lapply(x, as.character), stringsAsFactors=FALSE)

     > require(tm)
         Loading required package: tm

     > dd <- Corpus(DataframeSource(x))

      > inspect(dd)

Then simply use stemDocument like below:

  fbCorpus <- tm_map(fbCorpus, stemDocument)
user2481422
  • 868
  • 3
  • 17
  • 31
0

I had the same problem.

If you look at the arguments for stemDocuments you can specify the language of stemming. I found by specifying "English" it solved the problem for me.

stemDocument(language="english")
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51