4

Given a set of words V, I would like to group the synonym words in V together. I am wondering if there is any built-in function in NLTK and Wordnet that takes V as the input and automatically cluster them based on synonymity.

I already know how to extract the synonym of each word, but this is not what I am looking for. If I do so, the problem becomes complicated when the synonym sets are intersecting each other, or being subset/superset of each other, which needs writing a function removing the conflicts.

As an example, let's consider

V = ["good","constipate","bad","nice","defective","right","respectable","powerful"]

What I want to get as output is:

[('constipate'), ('nice'), ('bad', 'defective'), ('good', 'powerful', 'respectable', 'right')]

Now based on the size/number of the clusters, some sets might split into several sets, or combine together. Here, I am just caring for the words in V and their synonyms in V.

Mila
  • 285
  • 4
  • 13
  • If there's no defined no. of clusters you want, it's harder problem. – alvas Dec 12 '17 at 01:27
  • @alvas Ok, if I set the no. of clusters, is there any function doing this clustering? – Mila Dec 12 '17 at 07:14
  • 1
    Yes you can use k-means but first you have to get from word -> synsets -> synset distance -> cluster based on synset-lemma distance. Which isn't trivial. It's easier to do word2vec or LDA in gensim given a large corpus. – alvas Dec 12 '17 at 07:28
  • @alvas thank you for reply. I did it in word2vec and using k-means clustering. I will give a try using synset distance to see how results are different from word2vec... – Mila Dec 12 '17 at 07:39

1 Answers1

0

Yes, there is a way to do using nltk and wordnet. Following is an example. I am using built in sysnets and looking for synonyms for a 'book',

import nltk
from nltk.corpus import wordnet 

synonyms = []

for syn in wordnet.synsets('book'):
        for lemma in syn.lemmas():
            synonyms.append(lemma.name())

resulting synonyms for 'book' is

print(synonyms)
>>['book', 'book', 'volume', 'record', 'record_book', 'book', 'script', 'book', 'playscript', 'ledger', 'leger', 'account_book', 'book_of_account', 'book', 'book', 'book', 'rule_book', 'Koran', 'Quran', "al-Qur'an", 'Book', 'Bible', 'Christian_Bible', ..]

length of synonyms,

 len(synonyms)
 >>38

Note: Some synonyms are verb forms, and many synonyms are just different usages of 'book'. If, instead, we take the set of synonyms, there are fewer unique words, as shown in the following code:

len(set(synonyms)) 
 >>25

After using set operation,

{'record', 'Quran', 'Holy_Scripture', 'Koran', 'Good_Book', 'playscript', 'book', 'Word_of_God', 'hold', 'Holy_Writ', 'script', 'leger', 'book_of_account', 'Scripture', 'ledger', 'reserve', 'volume', 'record_book', "al-Qur'an", 'Christian_Bible', 'Word', 'rule_book', 'Bible', 'Book', 'account_book'}
i.n.n.m
  • 2,936
  • 7
  • 27
  • 51
  • Thank you for the answer. but this is not exactly what I am looking for. I added an example. In your case, do you think if I find all the synonyms of words in V and then using "set" function, repetition will be removed between synonym sets of words, so there is no intersections and conflicts with other sets? – Mila Dec 11 '17 at 17:16
  • @user5996916 yes, you can give it a try using `set` function. It will only return unique word from the list of synonyms! – i.n.n.m Dec 11 '17 at 17:18
  • @user5996916 I think, when you use `set` it gives only the unique synonyms. For example, I tried for `good` and I for 65 synonyms and when I used `set` i got only 37. – i.n.n.m Dec 11 '17 at 17:25