which similarity function of nltk.corpus.wordnet is Appropriate for find similarity of two words?

Question

which similarity function in nltk.corpus.wordnet is Appropriate for find similarity of two words?

 path_similarity()?
    lch_similarity()?
    wup_similarity()?
    res_similarity()?
    jcn_similarity()?
    lin_similarity()?

I want use a function for word clustering and yarowsky algorightm for find similar collocation in a large text.

score 7 · Answer 1 · answered Sep 13 '11 at 17:50

These measure are actually for word senses (or concepts) not words. That distinction might matter. In other words, the word "train" can mean "locomotive" or "being taught to do something". To use these measures you'd need to know which sense was intended.

If you want to do word clustering, these measures might not be exactly what you want...

score 3 · Answer 2 · answered Sep 22 '11 at 20:38

I've been playing with NLTK/wordnet myself for the purposes of trying to match up some texts in some automatic way. As Ted Pedersen's answer notes, it pretty quickly becomes clear that the similarity functions in nltk.corpus.wordnet only produce non-zero similarities for quite closely related terms with a solid IS-A pedigree.

What I ended up doing was taking the vocabulary in my texts, and then using lemma->synset->lemmas and lemma->similar_tos to grow my own word linkage graph (graph_tool fantastic for this) and then counting the minimum number of hops needed to link 2 words to get some sort of (dis-)similarity measure between them (quite entertaining to print these out; like watching a very bizarre word-association game). This did actually work well enough for my purposes even without any attempt to take POS/sense into account.

which similarity function of nltk.corpus.wordnet is Appropriate for find similarity of two words?

2 Answers2