6

Is there a huge CSV/XML or whatever file somewhere that contains a list of english verbs and their variations (e.g sell -> sold, sale, selling, seller, sellee)?

I imagine this will be useful for NLP systems, but there doesn't seem to be a listing anywhere, or it could be my terrible googling skills. Does anybody have a clue otherwise?

kamziro
  • 7,882
  • 9
  • 55
  • 78

3 Answers3

4

Consider Catvar:

A Categorial-Variation Database (or Catvar) is a database of clusters of uninflected words (lexemes) and their categorial (i.e. part-of-speech) variants. For example, the words hunger(V), hunger(N), hungry(AJ) and hungriness(N) are different English variants of some underlying concept describing the state of being hungry. Another example is the developing cluster:(develop(V), developer(N), developed(AJ), developing(N), developing(AJ), development(N)).

Kenston Choi
  • 2,862
  • 1
  • 27
  • 37
  • CatVar doesn't seem to be available anymore, the link is broken. Do you know where else I can find it? – Ogaday Jul 03 '16 at 20:26
  • 1
    You may try sending an email to the authors of the paper to ask for the official version. I found an unofficial copy in Github (https://github.com/bolei/trigram-classifier/tree/master/src/main/resources/script/catvar21). – Kenston Choi Jul 06 '16 at 06:45
3

I am not sure what you are looking for but I think WordNet -- a lexical database for the English language -- would be a good place to start. Read more at http://wordnet.princeton.edu/

The link I referred to you says that

WordNet's structure makes it a useful tool for computational linguistics and natural language processing.

One
  • 69
  • 9
1

Considering getting a dump of wiktionary and extracting this information out of it.
http://en.wiktionary.org/wiki/sell mentions many of the forms of the word (sells, selling, sold).

If your aim is simply to normalize words to some base canonical form, considering using a lemmatizer or stemmer. Trying playing with morpha which is a really good english lemmatizer.

Aditya Mukherji
  • 9,099
  • 5
  • 43
  • 49