4

I'm thinking of putting a stop words in my similarity program and then a stemmer (going for porters 1 or 2 depends on what easiest to implement)

I was wondering that since I read my text from files as whole lines and save them as a long string, so if I got two strings ex.

String one = "I decided buy something from the shop.";
String two = "Nevertheless I decidedly bought something from a shop.";

Now that I got those strings

Stemming: Can I just use the stemmer algoritmen directly on it, save it as a String and then continue working on the similarity like I did before implementing the stemmer in the program, like running one.stem(); kind of thing?

Stop word: How does this work out? O.o Do I just use; one.replaceall("I", ""); or is there some specific way to use for this proces? I want to keep working with the string and get a string before using the similarity algorithms on it to get the similarity. Wiki doesn't say a lot.

Hope you can help me out! Thanks.

Edit: It is for a school-related project where I'm writing a paper on similarity between different algorithms so I don't think I'm allowed to use lucene or other libraries that does the work for me. Plus I would like to try and understand how it works before I start using the libraries like Lucene and co. Hope it's not too much a bother ^^

skaffman
  • 398,947
  • 96
  • 818
  • 769
N00programmer
  • 1,111
  • 4
  • 13
  • 17

3 Answers3

11

If you're not implementing this for academic reasons you should consider using the Lucene library. In either case it might be good for reference. It has classes for tokenization, stop word filtering, stemming and similarity. Here's a quick example using Lucene 3.0 to remove stop words and stem an input string:

public static String removeStopWordsAndStem(String input) throws IOException {
    Set<String> stopWords = new HashSet<String>();
    stopWords.add("a");
    stopWords.add("I");
    stopWords.add("the");

    TokenStream tokenStream = new StandardTokenizer(
            Version.LUCENE_30, new StringReader(input));
    tokenStream = new StopFilter(true, tokenStream, stopWords);
    tokenStream = new PorterStemFilter(tokenStream);

    StringBuilder sb = new StringBuilder();
    TermAttribute termAttr = tokenStream.getAttribute(TermAttribute.class);
    while (tokenStream.incrementToken()) {
        if (sb.length() > 0) {
            sb.append(" ");
        }
        sb.append(termAttr.term());
    }
    return sb.toString();
}

Which if used on your strings like this:

public static void main(String[] args) throws IOException {
    String one = "I decided buy something from the shop.";
    String two = "Nevertheless I decidedly bought something from a shop.";
    System.out.println(removeStopWordsAndStem(one));
    System.out.println(removeStopWordsAndStem(two));
}

Yields this output:

decid bui someth from shop
Nevertheless decidedli bought someth from shop
WhiteFang34
  • 70,765
  • 18
  • 106
  • 111
0

Yes, you can wrap any stemmer so that you can write something like

String stemmedString = stemmer.stemAndRemoveStopwords(inputString, stopWordList);

Internally, your stemAndRemoveStopwords would

  • place all stopWords in a Map for fast reference
  • initialize an empty StringBuilder to holde the output string
  • iterate over all words in the input string, and for each word
    • search for it in the stopWordList; if found, continue to top of loop
    • otherwise, stem it using your preferred stemmer, and add it to to the output string
  • return the output string
tucuxi
  • 17,561
  • 2
  • 43
  • 74
  • Wait so what you are saying is that there's already a stopword function in porters stemmer? O.o Sorry, I think I'm not getting it. Could you explain it a little more. I was thinking if that Porter's stemmer already had a function like that or not. Having it would be easier to use it ;) – N00programmer May 25 '11 at 17:06
  • @N00 - a stemmer is just an algorithm to trim down words to their stems. It has no notion of stop-words; but removing them is really easy with a simple hashmap: put all your stopwords in the hashmap, and before you stem an input word, if it is in the hashmap, then you can discard it instead of stemming it. – tucuxi May 25 '11 at 17:21
  • Yes, seems that I'm making a bigger deal out of it than it is. Thanks for answering. – N00programmer May 26 '11 at 09:36
0

You don't have to deal with the whole text. Just split it, apply your stopword filter and stemming algorithm, then build the string again using a StringBuilder:

StrinBuilder builder = new StringBuilder(text.length());
String[] words = text.split("\\s+");
for (String word : words) {
    if (stopwordFilter.check(word)) { // Apply stopword filter.
        word = stemmer.stem(word); // Apply stemming algorithm.
        builder.append(word);
    }
}
text = builder.toString();
Eser Aygün
  • 7,794
  • 1
  • 20
  • 30
  • @Eser Aygün : Ahh but the problem here is that I'm running levenshtein as one of the algorithms and it's best to work with the text as a whole string on it and not as tokens. That's the reason why I would want to run it on the whole string and then end with a string, I can throw in the similarity machine aka still have a string to compare instead of rewriting it to compare the tokens in the levenshtein algorithm. – N00programmer May 25 '11 at 12:27
  • Oh, ok. Then why not just join the tokens using a StringBuilder? It's still easier than dealing with the whole text. – Eser Aygün May 25 '11 at 14:09
  • @Eser Aygün : hmmm...you mean first cut it to token, stopword it, stem it and then build the string again before running levenshtein on it? :o – N00programmer May 25 '11 at 17:01
  • @N00programmer Exactly. Why does this scare you? – Eser Aygün May 26 '11 at 06:30
  • @Eser Aygün : lol, no it doesn't scare me. I'm not that old to programming so I don't know that much about it. That the reason I'm asking so much was to be sure that I'm not misunderstanding anything. ;) Oh, one little question: Is there a big difference in Stringbuilder and Stringbuffer? I used Stringbuffer and it does the work but both you and WhiteFang use the other so I'm wondering if it's bad of me to use Stringbuffer cuz I will be using big strings later on too. – N00programmer May 26 '11 at 09:32
  • @N00programmer Ok. Sorry :) StringBuffer is synchronized, which means that multiple threads can work on a StringBuffer safely. StringBuilder, on the other hand, is not synchronized, and therefore it is a little bit faster. In your case, StringBuilder would be the right choice. – Eser Aygün May 26 '11 at 10:57