6

I have a wordnet database setup, and I'm trying to generate synonyms for various words.

For example, the word, "greatest". I'll look through and find several different synonyms, but none of them really fit the definition - for example, one is "superlative".

I'm guessing that I need to do some sort of check by frequency in a given language or stemming a word to get the base word (for example, greatest -> great, great -> best).

What table should I be using to ensure my words make some modicum of sense?

alvas
  • 115,346
  • 109
  • 446
  • 738
Steven Matthews
  • 9,705
  • 45
  • 126
  • 232
  • Lemmatize, don't stem. Also, could you elaborate on "what table ... sense?" – Chthonic Project Dec 01 '14 at 00:46
  • Greatest to great can probably be handed by a part-of-speech tagger see JJ, JJR, JJS here https://gate.ac.uk/sale/tao/splitap7.html#x39-802000G. As a really far-fetched suggestion, you can look into word embeddings: https://code.google.com/p/word2vec/ Close words are not synonyms but perhaps adjusting the model and training on the right data could generate synonyms. Or get the intersection between thesaurus results and word clusters. – Yasen Jan 05 '15 at 09:27

1 Answers1

4

Neither stemmer or lemmatizer can get you from greatest -> great:

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.stem import WordNetLemmatizer, PorterStemmer
>>> porter = PorterStemmer()
>>> wnl = WordNetLemmatizer()
>>> greatest = 'greatest'
>>> porter.stem(greatest)
u'greatest'
>>> wnl.lemmatize(greatest)
'greatest'
>>> greater = 'greater'
>>> wnl.lemmatize(greater)
'greater'
>>> porter.stem(greater)
u'greater'

But seems like you can make use of some nice properties of the PennTreeBank tagset to get from greatest -> great:

>>> from nltk import pos_tag
>>> pos_tag(['greatest'])
[('greatest', 'JJS')]
>>> pos_tag(['greater'])
[('greater', 'JJR')]
>>> pos_tag(['great'])
[('great', 'JJ')]

Let's try a crazy rule based system, let's start from greatest:

>>> import re
>>> word1 = 'greatest'
>>> re.sub('est$', '', word1) 
'great'
>>> re.sub('est$', 'er', word1) 
'greater'
>>> pos_tag([re.sub('est$', '', word1)])[0][1]
'JJ'
>>> pos_tag([re.sub('est$', 'er', word1)])[0][1]
'JJR'
>>> word1
'greatest'

Now that we know that we can build our own little superlative stemmer/lemmatizer/tail_substituter, let's write a rule that says if a word gives a superlative POS tag and our tail_substituter gives us JJ when we stem and JJR when we convert, we can safely say that the comparative and base form of the word can be easily gotten with our tail_substituter:

>>> if pos_tag([word1])[0][1] == 'JJS' \
... and pos_tag([re.sub('est$', '', word1)])[0][1] == 'JJ' \
... and pos_tag([re.sub('est$', 'er', word1)])[0][1] == 'JJR':
...     comparative = re.sub('est$', 'er', word1)
...     adjective = re.sub('est$', '', word1)
... 
>>> adjective
'great'
>>> comparative
'greater'

Now that gets you from greatest -> greater -> great. From great -> best is sort of weird, since lexically they're not not related although their semantics relative seems related.

So i think it would be subjective to say that great -> best is a valid transformation

alvas
  • 115,346
  • 109
  • 446
  • 738
  • I'm not actually using NLTK, but Wordnet converted into a MySQL database. But I'll look into this - this seems like a reasonable solution. – Steven Matthews Dec 09 '14 at 02:47