I want to extract relevant keywords from a html page.
I already stipped all html stuff, split the text into words, used a stemmer and removed all words appearing in a stop word list from lucene.
But now I still have alot of basic verbs and pronouns as most common words.
Is there some method or set of words in lucene or snowball or anywhere else to filter out all these things like "I, is , go, went, am, it, were, we, you, us,...."