How to get the wordnet sense frequency of a synset in NLTK?

Question

According to the documentation i can load a sense tagged corpus in nltk as such:

>>> from nltk.corpus import wordnet_ic
>>> brown_ic = wordnet_ic.ic('ic-brown.dat')
>>> semcor_ic = wordnet_ic.ic('ic-semcor.dat')

I can also get the definition, pos, offset, examples as such:

>>> wn.synset('dog.n.01').examples
>>> wn.synset('dog.n.01').definition

But how can get the frequency of a synset from a corpus? To break down the question:

first how to count many times did a synset occurs a sense-tagged corpus?
then the next step is to divide by the the count by the total number of counts for all synsets occurrences given the particular lemma.

in the lemma section of the documentation, it showed some counts but i'm not sure what they are http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html — alvas, Mar 21 '13 at 15:10

score 7 · Accepted Answer · answered Mar 21 '13 at 15:21

7

I managed to do it this way.

from nltk.corpus import wordnet as wn

word = "dog"
synsets = wn.synsets(word)

sense2freq = {}
for s in synsets:
  freq = 0  
  for lemma in s.lemmas:
    freq+=lemma.count()
  sense2freq[s.offset+"-"+s.pos] = freq

for s in sense2freq:
  print s, sense2freq[s]

answered Mar 21 '13 at 15:21

alvas

115,346
109
446
738

2

I would not rely on `lemma.count()`, many entries are zero and there is no information form which corpus the frequency data was taken. See also [this related question](http://stackoverflow.com/questions/5928704/how-do-i-find-the-frequency-count-of-a-word-in-english-using-wordnet/12376620#12376620) – Suzana Mar 21 '13 at 15:44
Thanks for the note on the 0 counts. It's sort of lame smoothing but i smoothened it though with laplace though. At least getting 0.001 is better than 0 and breaking the other subsystems in the pipeline =) – alvas Mar 21 '13 at 16:07
Unfortunately the sum is not the same as the displayed sense frequency when using the WordNet online to look up senses. The latter is the useful number in my opinion. – Radio Controlled Nov 22 '21 at 08:54

score -1 · Answer 2 · answered May 14 '19 at 13:21

-1

If you only need to know what the most frequent word is, you can do wn.synsets(word)[0] since WordNet generally ranks them from most frequent to least frequent.

(source: Daniel Jurafsky's Speech and Language Processing 2nd edition)

answered May 14 '19 at 13:21

alvitawa

394
1
4
12

2

This is not true because there is a categorization on top-level which follows a constant order of Parts of Speech. – Radio Controlled Nov 22 '21 at 08:48

How to get the wordnet sense frequency of a synset in NLTK?

2 Answers2

Linked

Related