Is there a way to drop stop words (e.g., 'of' 'a' 'the' etc.) before using a JAVA based document classifiers (such as OpenNLP) etc. Or if you are doing it yourself (with JAVA) what could be the most efficient way to do it (Given that string comparison is inefficient). Also, Given that each document itself is not that big, i.e.,on average around 100 words, but the number of documents is assumed to be large.
E.g.,
// Populate the stop words to a list
List<String> stopWordsList = ArrayList<>();
// Iterate through a list of documents
String currentDoc = getCurrentDoc();
String[] wordsArray = currentDoc.split(" ");
for ( String word : wordsArray ) {
if (stopWordsList.contains(word)){
// Drop it
}
}