2

I am trying to stem a Corpus using stemDocument in the R language tm package which calls Java. I have tried the example in the tm manual:

data("crude")
crude[[1]]
stemDocument(crude[[1]])

and get the following error:

Could not initialize the GenericProperitiesCreator.  This exception was produced:  
java.lang.NullPointerException

Any help appreciated. I know nothing about Java.

Thanks

user974490
  • 31
  • 1
  • 4

3 Answers3

1

Snowball stemmer (snowball.jar) cannot find the weka.jar file.

On your computer, you need to search for a file called weka.jar . On my linux system, it is located in

/usr/local/lib/R/site-library/RWekajars/java/weka.jar

Then, in your R code, add lines similar to these at the top:

wekajar="/usr/local/lib/R/site-library/RWekajars/java/weka.jar"
oldcp=Sys.getenv("CLASSPATH")
newcp=NULL
Sys.setenv(CLASSPATH=paste(wekajar,newcp, sep=":"))

library("tm")    
data("crude")
stemDocument(crude[[1]], language = "english" )

This sets the Java CLASSPATH for the R Session to the weka.jar file from above . Your existing classpath will be reset, though. You can try to add the old entries back if you have some , and if you need them.

knb
  • 9,138
  • 4
  • 58
  • 85
1

Good question, did you work it out?

I get the same error with the only the code that you have. But if you follow the example from the start (ie. at the heading 'transformations on p. 1) and you create a corpus and convert it to a Plain Text Document then you avoid the Java error. I guess that the code example in the manual assumes you've already done those two steps.

That said, when I inspect the results, there's no actual stemming... I can't even get @user813966's simple example of stemDocument to do any stemming. I'm looking at the RStem and SnowBall packages instead.

In the meantime, the python package NLTK is my stemming tool.

Update: I got the stemDocument function working by adding language = "english" as follows:

a <- tm_map(a, stemDocument, language = "english") 

So the complete answer to your question is to follow all the steps of inputting your text into R according to the tm package. You'll also need rJava (and to set environment variables for JAVA_HOME to the directory containing the jre directory if you're working in windows) to make stemDocument work

Community
  • 1
  • 1
Ben
  • 41,615
  • 18
  • 132
  • 227
1

I had same error on my side. Solved it by adding the Snowball .jar and the corresponding /words repository of stem words in my class path: C:\Users\xxx.xxx\Documents\R\win-library\2.12\Snowball\java

This was recommended here: http://weka.wikispaces.com/Stemmers

I still have the following error but it works fine now:

Trying to add database driver (JDBC): RmiJdbc.RJDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): jdbc.idbDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): org.gjt.mm.mysql.Driver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): com.mckoi.JDBCDriver - Warning, not in CLASSPATH?
Trying to add database driver (JDBC): org.hsqldb.jdbcDriver - Warning, not in CLASSPATH?
[KnowledgeFlow] Loading properties and plugins...
[KnowledgeFlow] Initializing KF...
competent_tech
  • 44,465
  • 11
  • 90
  • 113
SnRf
  • 11
  • 1
  • 1
    Thanks for posting this. I tried to follow the same instructions, and I get the same error that you posted. However, I the text is not stemmed. I think I might be missing a step. When you say that you add the corresponding /words repository, what is that? is that included in the \Snowball\java as well? – exl Dec 10 '11 at 04:46