3

It seems my Google-fu is failing me.

Does anyone know of a freely available word base dictionary that just contains bases of words? So, for something like strawberries, it would have strawberry. But does NOT contain abbreviations or misspellings or alternate spellings (like UK versus US)? Anything quickly usable in Java would be good but just a text file of mappings or anything that could be read in would be helpful.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
AHungerArtist
  • 9,332
  • 17
  • 73
  • 109

3 Answers3

5

This is called lemmatization, and what you call the "base of a word" is called a lemma. morpha and its reimplementation in the Stanford POS tagger do this. Both, however, require POS tagged input to resolve the inherent ambiguity in natural language.

(POS tagging means determining the word categories, e.g. noun, verb. I've been assuming you want a tool that handles English.)

Edit: since you're going to use this for search, here's a few tips:

  • Simple stemming for English has a mixed reputation in the search engine world. Sometimes it works, often it doesn't.
  • Automatic spelling correction may work better. This is what Google does. It's expensive in terms of computing time, though, if you want to do it right.
  • Lemmatization may provide benefits, but probably only if you index and search for both the words and the lemmas. (Same advice goes for stemming.)
  • Here's a plugin for Lucene that does lemmatization.

(Preceding remarks are based on my own research; I wrote my master's thesis about lemmatization in search engines for very noisy data.)

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • I want something that is always accurate (though not necessarily complete), which that doesn't seem like it can provide (nor can I possibly categorize all the potential words). I'd rather have some words not be appropriately lemmatized (?) then to have any incorrect ones. – AHungerArtist Oct 26 '10 at 15:40
  • Then you need a simple word list, since these programs represent the state-of-the-art in POS tagging and lemmatization. (Categorizing the words is by the way exactly what the Stanford POS tagger does. It's not exactly plug-and-play, though.) – Fred Foo Oct 26 '10 at 15:51
  • Right, that is what I'm looking for, a simple word list. I'm using a dictionary now that has what I'm looking for, but it's also full of alternate spellings, abbreviations, and other such things so that it's not as useful as it could be. – AHungerArtist Oct 26 '10 at 16:06
  • In any case, thanks for the input. If I don't find anything else, I will look into this work a little more closely and just see exactly what kind of results I can get from it. – AHungerArtist Oct 26 '10 at 16:12
  • 1
    I find that stemming works pretty well for searching so long as you run the data through the stemmer when you index it **and** run the query string through the same stemmer. Have done this with Lucene with excellent results. – Qwerky Oct 26 '10 at 16:19
  • @Qwerky: yes, it may work, but it doesn't always, depending on document set and query quality. It's something to try, though. (Indexing and searching for both stemmer output and the original terms may work even better.) – Fred Foo Oct 26 '10 at 16:25
  • I can't really afford to do two searches as speed is of the essence, though that almost certainly would give me optimal results. And currently I am running both the index (in my case, a trie) and the input through the substitution but it only works best when the full word is given as input. If there's only a partial string, it can end up not returning any results depending on how a word is substituted (or stemmed if I went that route). – AHungerArtist Oct 26 '10 at 16:45
  • One search for double the number of keywords may be faster. Lemmatizing may be slow, though. – Fred Foo Oct 26 '10 at 18:40
1

This isn't exactly what you're asking for, but Wikipedia on stemming was enlightening and contains a number of links to free stemming programs. Which presumably should include lists of word stems

The Archetypal Paul
  • 41,321
  • 20
  • 104
  • 134
  • The problem with stemmers is that they tend to produce bogus output such as "strawberri". – Fred Foo Oct 26 '10 at 15:34
  • @larsmans: eh, but seen that 'strawberri' is not a correct english word, ain't it trivial to run the result of the stemmer into a spellchecker that would then return 'strawberry' as a suggestion? – SyntaxT3rr0r Oct 26 '10 at 17:59
  • True, but stemmers can give far worse results than that. Might work, though. Might. (Paul's reasoning that stemmers "should include lists of word stems" is not generally true btw., as many stemmers are just simple string algorithms.) – Fred Foo Oct 26 '10 at 18:28
1

http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:start

The Miriam Websters Collegiate 9th Edition link on this page contains a word file of only the root forms of words. Strawberry is in there, Strawberries is not. Likewise "add" is in there "adding" is not. Not sure if this is what you are after, but it was helpful for me.