I am using RapidMiner 5. I want to make a text preprocessing module to use with a categorization system. I created a process in RapidMiner with these steps.
- Tokenize
- Transform Case
- Stemming
- Filtering stopwords
- Generating n-grams
I want to write a script to do spell correction for these words. So, I used 'Execute Script' operator and wrote a groovy script for doing this (from here- raelcunha). This is the code ( helped by RapidMiner community) I wrote in execute Script operator of rapid miner.
Document doc=input[0]
List<Token> newTokens = new LinkedList<Token>();
nWords=train("set2.txt")
for (Token token : doc.getTokenSequence()) {
//String output=correct((String)token.getToken(),nWords)
println token.getToken();
Token nToken = new Token(correct("garbge",nWords), token);
newTokens.add(nToken);
}
doc.setTokenSequence(newTokens);
return doc;
This is the code for spell correction. ( Thanks to Norvig.)
import com.rapidminer.operator.text.Document;
import com.rapidminer.operator.text.Token;
import java.util.List;
import java.util.LinkedList;
def train(f){
def n = [:]
new File(f).eachLine{it.toLowerCase().eachMatch(/\w+/){n[it]=n[it]?n[it]+1:1}}
n
}
def edits(word) {
def result = [], n = word.length()-1
for(i in 0..n) result.add(word[0..<i] + word.substring(i+1))
for(i in 0..n-1) result.add(word[0..<i] + word[i+1] + word[i, i+1] + word.substring(i+2))
for(i in 0..n) for(c in 'a'..'z') result.add(word[0..<i] + c + word.substring(i+1))
for(i in 0..n) for(c in 'a'..'z') result.add(word[0..<i] + c + word.substring(i))
result
}
def correct(word, nWords) {
if(nWords[word]) return word
def list = edits(word), candidates = [:]
for(s in list) if(nWords[s]) candidates[nWords[s]] = s
if(candidates.size() > 0) return candidates[candidates.keySet().max()]
for(s in list) for(w in edits(s)) if(nWords[w]) candidates[nWords[w]] = w
return candidates.size() > 0 ? candidates[candidates.keySet().max()] : word
}
I am getting String index out of bounds exception while calling edits
method.
And, I do not know how to debug this because rapidminer just tells me that there is an issue in the Execute Script operator and not saying which line of script caused this issue.
So, I am planning to do the same thing by creating an operator in Java as mentioned here-How to extend RapidMiner
The things I did:
Included all jar files from RapidMiner Lib folder , (C:\Program Files (x86)\Rapid-I\RapidMiner5\lib ) into the build path of my java project.
Started coding using the same guide the link to which is given above.
Input for my operator is a Document ( com.rapidminer.operator.text.Document) as in the script.
But, I am not able to use this Document object in this code. Can you tell me why? Where are the text processing jars located?
For using the plugin jars, should we add some other locations to the BuildPath?