5

which similarity function in nltk.corpus.wordnet is Appropriate for find similarity of two words?

 path_similarity()?
    lch_similarity()?
    wup_similarity()?
    res_similarity()?
    jcn_similarity()?
    lin_similarity()?

I want use a function for word clustering and yarowsky algorightm for find similar collocation in a large text.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
Masoud Abasian
  • 10,549
  • 6
  • 23
  • 22

2 Answers2

7

These measure are actually for word senses (or concepts) not words. That distinction might matter. In other words, the word "train" can mean "locomotive" or "being taught to do something". To use these measures you'd need to know which sense was intended.

If you want to do word clustering, these measures might not be exactly what you want...

Ted Pedersen
  • 266
  • 1
  • 1
3

I've been playing with NLTK/wordnet myself for the purposes of trying to match up some texts in some automatic way. As Ted Pedersen's answer notes, it pretty quickly becomes clear that the similarity functions in nltk.corpus.wordnet only produce non-zero similarities for quite closely related terms with a solid IS-A pedigree.

What I ended up doing was taking the vocabulary in my texts, and then using lemma->synset->lemmas and lemma->similar_tos to grow my own word linkage graph (graph_tool fantastic for this) and then counting the minimum number of hops needed to link 2 words to get some sort of (dis-)similarity measure between them (quite entertaining to print these out; like watching a very bizarre word-association game). This did actually work well enough for my purposes even without any attempt to take POS/sense into account.

timday
  • 24,582
  • 12
  • 83
  • 135