9

According to the documentation i can load a sense tagged corpus in nltk as such:

>>> from nltk.corpus import wordnet_ic
>>> brown_ic = wordnet_ic.ic('ic-brown.dat')
>>> semcor_ic = wordnet_ic.ic('ic-semcor.dat')

I can also get the definition, pos, offset, examples as such:

>>> wn.synset('dog.n.01').examples
>>> wn.synset('dog.n.01').definition

But how can get the frequency of a synset from a corpus? To break down the question:

  1. first how to count many times did a synset occurs a sense-tagged corpus?
  2. then the next step is to divide by the the count by the total number of counts for all synsets occurrences given the particular lemma.
alvas
  • 115,346
  • 109
  • 446
  • 738
  • in the lemma section of the documentation, it showed some counts but i'm not sure what they are http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html – alvas Mar 21 '13 at 15:10

2 Answers2

7

I managed to do it this way.

from nltk.corpus import wordnet as wn

word = "dog"
synsets = wn.synsets(word)

sense2freq = {}
for s in synsets:
  freq = 0  
  for lemma in s.lemmas:
    freq+=lemma.count()
  sense2freq[s.offset+"-"+s.pos] = freq

for s in sense2freq:
  print s, sense2freq[s]
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 2
    I would not rely on `lemma.count()`, many entries are zero and there is no information form which corpus the frequency data was taken. See also [this related question](http://stackoverflow.com/questions/5928704/how-do-i-find-the-frequency-count-of-a-word-in-english-using-wordnet/12376620#12376620) – Suzana Mar 21 '13 at 15:44
  • Thanks for the note on the 0 counts. It's sort of lame smoothing but i smoothened it though with laplace though. At least getting 0.001 is better than 0 and breaking the other subsystems in the pipeline =) – alvas Mar 21 '13 at 16:07
  • Unfortunately the sum is not the same as the displayed sense frequency when using the WordNet online to look up senses. The latter is the useful number in my opinion. – Radio Controlled Nov 22 '21 at 08:54
-1

If you only need to know what the most frequent word is, you can do wn.synsets(word)[0] since WordNet generally ranks them from most frequent to least frequent.

(source: Daniel Jurafsky's Speech and Language Processing 2nd edition)

alvitawa
  • 394
  • 1
  • 4
  • 12