Wordnet Information Content (IC) Files Python

Question

Is there any documentation anywhere on the main differences between the IC files in NLTK Wordnet?

Specifically, looking for the differences between brown_ic, semcor_ic, genesis_ic, etc. so I can determine which one is best for my corpus of words in similarity efforts.

Additional question: do all aforementioned similarity measures require all words be in the same POS?

Found some details on brown_ic here: https://stackoverflow.com/questions/18705778/what-is-the-use-of-brown-corpus-in-measuring-semantic-similarity-based-on-wordne — Ksofiac, Aug 07 '17 at 16:31

score 0 · Answer 1 · answered Aug 10 '17 at 12:13

I think you need to google each corpus separately. The list at http://www.nltk.org/nltk_data/ really only gives the sizes and license.

Brown corpus is 1961 American English, a mix of fact and fiction. See https://en.wikipedia.org/wiki/Brown_Corpus

semcor is a subset of Brown corpus.

genesis is bible text according to http://nlpforhackers.io/corpora/ (which looks useful information on some of the others, too)

Wordnet Information Content (IC) Files Python

1 Answers1