2

How do I remove these stopwords in the most efficient way. The approach below doesn't remove the stopwords. What am I missing?

Is there any other way to do this?

I want to accomplish this in the most time efficient way in Java.

public static HashSet<String> hs = new HashSet<String>();


public static String[] stopwords = {"a", "able", "about",
        "across", "after", "all", "almost", "also", "am", "among", "an",
        "and", "any", "are", "as", "at", "b", "be", "because", "been",
        "but", "by", "c", "can", "cannot", "could", "d", "dear", "did",
        "do", "does", "e", "either", "else", "ever", "every", "f", "for",
        "from", "g", "get", "got", "h", "had", "has", "have", "he", "her",
        "hers", "him", "his", "how", "however", "i", "if", "in", "into",
        "is", "it", "its", "j", "just", "k", "l", "least", "let", "like",
        "likely", "m", "may", "me", "might", "most", "must", "my",
        "neither", "n", "no", "nor", "not", "o", "of", "off", "often",
        "on", "only", "or", "other", "our", "own", "p", "q", "r", "rather",
        "s", "said", "say", "says", "she", "should", "since", "so", "some",
        "t", "than", "that", "the", "their", "them", "then", "there",
        "these", "they", "this", "tis", "to", "too", "twas", "u", "us",
        "v", "w", "wants", "was", "we", "were", "what", "when", "where",
        "which", "while", "who", "whom", "why", "will", "with", "would",
        "x", "y", "yet", "you", "your", "z"};
public StopWords()
{
    int len= stopwords.length;
    for(int i=0;i<len;i++)
    {
        hs.add(stopwords[i]);
    }
    System.out.println(hs);
}

public List<String> removedText(List<String> S)
{
    Iterator<String> text = S.iterator();

    while(text.hasNext())
    {
        String token = text.next();
        if(hs.contains(token))
        {

                S.remove(text.next());
        }
        text = S.iterator();
    }
    return S;
}
Mureinik
  • 297,002
  • 52
  • 306
  • 350
Shorbhaja
  • 29
  • 4
  • looks good to me. how big is list S going to be? if it's especially large the solution might be to not load words into the list to begin with and do the processing on an Input/Output stream level. But I would only do that if you actually had a performance or memory problem with the current implementation. – slipperyseal Jan 20 '16 at 06:15
  • instead of removing the strings from the list (causing an internal copy down), you could set nulls where the stop words are. then when you output the list, ignore the nulls, or copy the list at the end, and exclude the nulls at that point. – slipperyseal Jan 20 '16 at 06:17
  • It's not removing the stopwords from the List. – Shorbhaja Jan 20 '16 at 07:47

4 Answers4

2

You shouldn't manipulate the list while iterating over it. Moreover, you're calling next() twice under the same loop that evaluates hasNext(). Instead, you should use the iterator to remove the item:

public static List<String> removedText(List<String> s) {
    Iterator<String> text = s.iterator();

    while (text.hasNext()) {
        String token = text.next();
        if (hs.contains(token)) {
            text.remove();
        }
    }
    return s;
}

But that's a bit of "reinventing the wheel", instead, you could just use the removeAll(Collcetion) method:

s.removeAll(hs);
Mureinik
  • 297,002
  • 52
  • 306
  • 350
0

maybe you can use org/apache/commons/lang/ArrayUtils inside loop.

stopwords = ArrayUtils.removeElement(stopwords, element)

https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/ArrayUtils.html

ZaoTaoBao
  • 2,567
  • 2
  • 20
  • 28
0

I think that the most efficient way is use the binarySearch method with a sorted list of terms

int indexStop = Collections.binarySearch(EncyclopediaHelper.getStopWords(), string, String::compareToIgnoreCase);

boolean stop = indexStop > 0 

More information here: What is the performance of Collections.binarySearch over manually searching a list?

Martin
  • 1,282
  • 1
  • 15
  • 43
-1

Try the below changes suggested:

public static List<String> removedText(List<String> S)
{
    Iterator<String> text = S.iterator();

    while(text.hasNext())
    {
        String token = text.next();
        if(hs.contains(token))
        {

                S.remove(token); ////Changed text.next() --> token
        }
       // text = S.iterator(); why the need to re-assign?
    }
    return S;
}
LChukka
  • 11
  • 4
  • Tried. It's not removing the token from S :( Also, I was previously getting an error related to comodification, which was mostly due to the list being modified as a result the state of iterator becoming inconsistent. – Shorbhaja Jan 20 '16 at 07:41
  • 1
    java.util.ConcurrentModificationException – Shorbhaja Jan 20 '16 at 07:53