1

Does NLTK or any other NLP tools provide a lib to measure vocabulary ordinary level?

By that ordinary level, I mean certain words are simple and more frequently used like "and, age, yes, this, those, kind", which any elementary school student must know. Similar to that Longman English Dictionary (usually for ESL) has defined a 3000-word basic vocabulary for explaining all the entries with.

There could be a set of rare words that fall into the rare-used level, which only pedantic uses, like Agastopia, Impignorate, Gobbledygook, etc.

There are for sure some levels in between of these 2 extremes. Certainly, this level definition is purely subjective and I expect different organizations or persons may have different views. At least it could vary region from region.

My purpose is to measure the difficulty/complexity of some passages, well, currently naively, by just checking its vocabulary.

"Ordinary level' might not be the good description, but I am not able find a proper and formal expression :). I hope my explanation clarifies my purpose.

David
  • 39
  • 4

1 Answers1

2

An empirical approach to this problem is to use the term frequencies in a large corpus of documents. Using most of English wikipedia, I have created a dictionary of term frequencies (which can be downloaded here)

import pickle
with open('/home/user/data/enWikipediaDictTermCounts.pickle', 'rb') as handle:
    d = pickle.load(handle)

#common words will have high counts (they appear many times in wikipedia):

d.get('age',0)
#207669
d.get('kind',0)
#62302

#rare words will have low counts:

d.get('agastopia',0)
#1
d.get('gobbledygook',0)
#39
d.get('serendipitous',0)
#186

Rare words will appear fewer that 500 times and common words will appear more than 10K times. You can play with these thresholds to find the right level of rarety (resp. commonness) for your application.
remark: note that all words have been converted to lowercase in the dictionary

DBaker
  • 2,079
  • 9
  • 15
  • That's a great solution! Thanks DBaker! What is the 2nd arg mean in get()? – David Feb 24 '20 at 05:31
  • One point may be missing in it is to deal with word inflection, e.g. travel, travels, travelling, travelled. But I guess this can be solved by other NLP tools. – David Feb 24 '20 at 05:41
  • 1
    The 2nd arg to get() is the default which is returned if the word is not in the python dictionary. Also conjugated verbs such as traveling and traveled should also be present in the dictionary. Btw please upvote and accept my answer if you like it :) – DBaker Feb 24 '20 at 13:33