Using RapidMiner Textprocessing plugin in Java - Not able to use 'Document' object in the code

Question

I am using RapidMiner 5. I want to make a text preprocessing module to use with a categorization system. I created a process in RapidMiner with these steps.

Tokenize
Transform Case
Stemming
Filtering stopwords
Generating n-grams

I want to write a script to do spell correction for these words. So, I used 'Execute Script' operator and wrote a groovy script for doing this (from here- raelcunha). This is the code ( helped by RapidMiner community) I wrote in execute Script operator of rapid miner.

Document doc=input[0]

List<Token> newTokens = new LinkedList<Token>();
nWords=train("set2.txt")
for (Token token : doc.getTokenSequence()) {

    //String output=correct((String)token.getToken(),nWords) 
    println token.getToken();
    Token nToken = new Token(correct("garbge",nWords), token);
    newTokens.add(nToken);

}
doc.setTokenSequence(newTokens);
return doc;

This is the code for spell correction. ( Thanks to Norvig.)

    import com.rapidminer.operator.text.Document;
import com.rapidminer.operator.text.Token;

import java.util.List;
import java.util.LinkedList;
def train(f){
    def n = [:]
    new File(f).eachLine{it.toLowerCase().eachMatch(/\w+/){n[it]=n[it]?n[it]+1:1}}
    n
}

def edits(word) {
    def result = [], n = word.length()-1
    for(i in 0..n) result.add(word[0..<i] + word.substring(i+1))
    for(i in 0..n-1) result.add(word[0..<i] + word[i+1] + word[i, i+1] + word.substring(i+2))
    for(i in 0..n) for(c in 'a'..'z') result.add(word[0..<i] + c + word.substring(i+1))
    for(i in 0..n) for(c in 'a'..'z') result.add(word[0..<i] + c + word.substring(i))
    result
}

def correct(word, nWords) {
    if(nWords[word]) return word
    def list = edits(word), candidates = [:]
    for(s in list) if(nWords[s]) candidates[nWords[s]] = s
    if(candidates.size() > 0) return candidates[candidates.keySet().max()]
    for(s in list) for(w in edits(s)) if(nWords[w]) candidates[nWords[w]] = w
    return candidates.size() > 0 ? candidates[candidates.keySet().max()] : word
}

I am getting String index out of bounds exception while calling edits method. And, I do not know how to debug this because rapidminer just tells me that there is an issue in the Execute Script operator and not saying which line of script caused this issue.

So, I am planning to do the same thing by creating an operator in Java as mentioned here-How to extend RapidMiner

The things I did:

Included all jar files from RapidMiner Lib folder , (C:\Program Files (x86)\Rapid-I\RapidMiner5\lib ) into the build path of my java project.
Started coding using the same guide the link to which is given above.
Input for my operator is a Document ( com.rapidminer.operator.text.Document) as in the script.

But, I am not able to use this Document object in this code. Can you tell me why? Where are the text processing jars located?

For using the plugin jars, should we add some other locations to the BuildPath?

Do we have to guess both the Errors you are getting AND the code you have written? — tim_yates, Mar 11 '15 at 10:18
Ok. I think I asked too many things..I am rephrasing the question. — pnv, Mar 11 '15 at 10:59
I can't see an `edits` method in the code you posted? Can you also post the full error you get? — tim_yates, Mar 11 '15 at 16:43

Using RapidMiner Textprocessing plugin in Java - Not able to use 'Document' object in the code

0 Answers0