What is the efficient way to drop stop words (e.g., of a the etc.) from a text using Java

Question

Is there a way to drop stop words (e.g., 'of' 'a' 'the' etc.) before using a JAVA based document classifiers (such as OpenNLP) etc. Or if you are doing it yourself (with JAVA) what could be the most efficient way to do it (Given that string comparison is inefficient). Also, Given that each document itself is not that big, i.e.,on average around 100 words, but the number of documents is assumed to be large.

E.g., 
// Populate the stop words to a list
List<String> stopWordsList = ArrayList<>();

// Iterate through a list of documents
String currentDoc = getCurrentDoc();

String[] wordsArray = currentDoc.split(" ");    

 for ( String word : wordsArray ) {

      if (stopWordsList.contains(word)){
           // Drop it
      }
  }

score 0 · Answer 1 · answered Jul 17 '14 at 03:41

Your technique is fine. However, you should make your stopWordsList a Set, not a List, so that you can look things up in constant time instead of linear time. In other words, you don't want to have to look through the whole stopWordsList to see if word is in there; you want to just see if it's in the set right away.

maheeka · Answer 2 · 2014-10-07T09:19:16.867

-1

You can try the following code :

    String sentence = "This is a sample sentence for testing stop word deletion";

    String pattern = " a | the | for | is ";
    sentence = sentence.replaceAll(pattern, " ");

Result : This sample sentence testing stop word deletion

The pattern contains all the stop words separated by pipeline, to say that the pattern may contain either of those. Remember to have the spaces around the stop words to distinguish them as exact words. If not for the spaces it will replace all occurrences of the stop word's character combination even within words.

edited Oct 07 '14 at 09:19

answered Oct 07 '14 at 09:12

maheeka

1,983
1
17
25

Stop words can be upper/lower case, + your pattern doesn't work if the stop word is not wrapped between two white spaces, which could be the case if the sentence starts or ends with a stopword, or if followed by a comma, etc... – cheseaux Oct 15 '14 at 09:13
Noted. But most of the above mentioned could be solved with proper regex pattern, or a series of regex patterns for that matter. – maheeka Oct 15 '14 at 23:55

score -2 · Answer 3 · answered Jul 16 '14 at 20:41

-2

No need to split, simply replace the target string with an empty string

String currentDoc = getCurrentDoc();
currentDoc = currentDoc.replace(stringToReplace,"");

Or, go with regex using replaceAll if you have multiple words to replace.

answered Jul 16 '14 at 20:41

C.B.

8,096
5
20
34

Um, no, this is not a good idea. "a" is a stopword. Simply doing a replace would turn "apple" into "pple", which is clearly not what you want. – dhg Jul 17 '14 at 03:37

What is the efficient way to drop stop words (e.g., of a the etc.) from a text using Java

3 Answers3