0

Machine: Windows 7 - 64 bit R Version : R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"

I am working on stemming some text for an analysis that I am doing, I am able to do everything all the way up until 'stemComplete' For more context please see the below;

Packages:

  1. TM
  2. SnowballC
  3. rJava
  4. RWeka
  5. Rwekajars
  6. NLP

Sample list of words

test <- as.vector(c('win', 'winner', 'wins', 'wins', 'winning'))

Convert to Corpus

Test_Corpus <- Corpus(VectorSource(test))

Text manipulations`

Test_Corpus <- tm_map(Survey_Corpus, content_transformer(tolower))
Test_Corpus <- tm_map(Survey_Corpus, removePunctuation)
Test_Corpus <- tm_map(Survey_Corpus, removeNumbers)

Stemming using tm_map under the tm package

>Test_stem <- tm_map(Test_Corpus, stemDocument, language = 'english' )

Below is the result from stemming above, which is all correct so far:

  1. win
  2. winner
  3. win
  4. win
  5. win

Now comes the issue! When I try to use test_corpus as a dictionary to transform the words back to an appropriate format using the following code;

>Test_complete <- tm_map(Test_stem, stemCompletion, Test_Corpus)

Below is the error message that I am getting:

Warning messages:

1: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be  used
2: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
3: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
4: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used
5: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
argument 'pattern' has length > 1 and only the first element will be used

I have tried several things listed on previous posts and seen that other people with the same problem have tried with no luck. Below is a list of those things:

  1. Update Java
  2. used content_transformation
  3. used PlainTextDocument
Jeffrey Bosboom
  • 13,313
  • 16
  • 79
  • 92
Jacob Johnston
  • 121
  • 1
  • 1
  • 5
  • I'm not sure your formatting is doing what you think it is. Indent for code blocks (including comments) and try to avoid overuse of headers. – Nathan Tuggy Feb 20 '15 at 01:19

1 Answers1

0

I think you need to save your test_corpus as a dictionary before the stemming process. You could try something like Test_Corpus <- corpus then you could start the steming and using corpus later on in Test_complete <- tm_map(corpus, stemCompletion).