1

Is there any documentation anywhere on the main differences between the IC files in NLTK Wordnet?

Specifically, looking for the differences between brown_ic, semcor_ic, genesis_ic, etc. so I can determine which one is best for my corpus of words in similarity efforts.

Additional question: do all aforementioned similarity measures require all words be in the same POS?

Ksofiac
  • 382
  • 1
  • 6
  • 21
  • Found some details on brown_ic here: https://stackoverflow.com/questions/18705778/what-is-the-use-of-brown-corpus-in-measuring-semantic-similarity-based-on-wordne – Ksofiac Aug 07 '17 at 16:31

1 Answers1

0

I think you need to google each corpus separately. The list at http://www.nltk.org/nltk_data/ really only gives the sizes and license.

Brown corpus is 1961 American English, a mix of fact and fiction. See https://en.wikipedia.org/wiki/Brown_Corpus

semcor is a subset of Brown corpus.

genesis is bible text according to http://nlpforhackers.io/corpora/ (which looks useful information on some of the others, too)

Darren Cook
  • 27,837
  • 13
  • 117
  • 217