22

Is there a way to find the frequency of the usage of a word in the English language using WordNet or NLTK using Python?

NOTE: I do not want the frequency count of a word in a given input file. I want the frequency count of a word in general based on the usage in today's time.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Apps
  • 529
  • 3
  • 8
  • 15

8 Answers8

21

In WordNet, every Lemma has a frequency count that is returned by the method lemma.count(), and which is stored in the file nltk_data/corpora/wordnet/cntlist.rev.

Code example:

from nltk.corpus import wordnet
syns = wordnet.synsets('stack')
for s in syns:
    for l in s.lemmas():
        print l.name + " " + str(l.count())

Result:

stack 2
batch 0
deal 1
flock 1
good_deal 13
great_deal 10
hatful 0
heap 2
lot 13
mass 14
mess 0
...

However, many counts are zero and there is no information in the source file or in the documentation which corpus was used to create this data. According to the book Speech and Language Processing from Daniel Jurafsky and James H. Martin, the sense frequencies come from the SemCor corpus which is a subset of the already small and outdated Brown Corpus.

So it's probably best to choose the corpus that fits best to the your application and create the data yourself as Christopher suggested.

To make this Python3.x compatible just do:

Code example:

from nltk.corpus import wordnet
syns = wordnet.synsets('stack')
for s in syns:
    for l in s.lemmas():
        print( l.name() + " " + str(l.count()))
Suzana
  • 4,251
  • 2
  • 28
  • 52
  • Just to echo @Suzana_K's point, I am finding wordnet's `lemma.count()` not very useful given the number of 0's in the counts and the overall lack of frequency distinctions among words. – Ram Narasimhan Oct 07 '12 at 06:19
  • Based on the description for frequency counts in the [official WordNet documentation](https://wordnet.princeton.edu/wordnet/frequently-asked-questions/for-linguists/), I'm not sure it means what we think it means: > "Frequency counts are based on the number of senses a word has." – anana Nov 07 '15 at 22:40
  • 1
    Then why do most have a frequency count of zero? A word with zero senses makes no sense. – Suzana Nov 08 '15 at 22:27
  • 3
    The counts are induced over sense tagged texts, which are expensive to generate. Many of the senses in WordNet are extremely distinct (and fickle) which means that finding an example of them in a random sentence is fairly unlikely. Takeaway message: tagged data is hard to find, the synsets only count sense-tagged instances of words. If you don't care about senses, use raw corpus counts instead (not wordnet) – Ritwik Bose Nov 13 '15 at 02:47
  • 1
    According to the book 'Speech and Language Processing' from Daniel Jurafsky, James H. Martin, WordNet gets their sense frequencies from a 'SemCor' corpus. (page 742 of the second edition) – alvitawa May 14 '19 at 13:12
  • Thanks for the information. The [SemCor Corpus](https://www.sketchengine.eu/semcor-annotated-corpus/) is just a semantically annotated subset of the brown corpus, which is already small and outdated (1960s). That would explain the bad quality of the frequency data. – Suzana May 14 '19 at 14:37
12

You can sort of do it using the brown corpus, though it's out of date (last revised in 1979), so it's missing lots of current words.

import nltk
from nltk.corpus import brown
from nltk.probability import *

words = FreqDist()

for sentence in brown.sents():
    for word in sentence:
        words.inc(word.lower())

print words["and"]
print words.freq("and")

You could then cpickle the FreqDist off to a file for faster loading later.

A corpus is basically just a file full of sentences, one per line, and there are lots of other corpora out there, so you could probably find one that fits your purpose. A couple of other sources of more current corpora: Google, American National Corpus.

You can also suppsedly get a current list of the top 60,000 words and their frequencies from the Corpus of Contemporary American English

Christopher Pickslay
  • 17,523
  • 6
  • 79
  • 92
  • Perfect solution for analysing older texts. The ``import nltk`` isn't necessarily needed and the ``from nltk.probablity import *`` could be changed to only import ``FreqDist``. – davidjb Aug 02 '14 at 05:40
  • How to print all the words of a corpus with its frequencies and cpickle the FreqDist off to a file in Python? Please help as I'm newbie in Python pickling. – M S Nov 25 '18 at 20:05
  • inc attribute is deprecated, see this post https://stackoverflow.com/questions/25827058/attributeerror-freqdist-object-has-no-attribute-inc – Woden Mar 23 '21 at 09:30
3

Check out this site for word frequencies: http://corpus.byu.edu/coca/

Somebody compiled a list of words taken from opensubtitles.org (movie scripts). There's a free simple text file formatted like this available for download. In many different languages.

you 6281002
i 5685306
the 4768490
to 3453407
a 3048287
it 2879962

http://invokeit.wordpress.com/frequency-word-lists/

live-love
  • 48,840
  • 22
  • 240
  • 204
2

You can't really do this, because it depends so much on the context. Not only that, for less frequent words the frequency will be wildly dependent on the sample.

Your best bet is probably to find a large corpus of text of the given genre (e.g. download a hundred books from Project Gutenberg) and count the words yourself.

Katriel
  • 120,462
  • 19
  • 136
  • 170
  • 3
    Be wary though of the fact that Project Gutenberg only has literary books. If you are interested in a more coloquial english then you might need a different source such as online blog posts/comment threads. Also, please be nice to any websites from where you might decide to scrape off content :) – Mihai Oprea May 08 '11 at 17:36
2

Take a look at the Information Content section of the Wordnet Similarity project at http://wn-similarity.sourceforge.net/. There you will find databases of word frequencies (or, rather, information content, which is derived from word frequency) of Wordnet lemmas, calculated from several different corpora. The source codes are in Perl, but the databases are provided independently and can be easily used with NLTK.

YKS
  • 355
  • 1
  • 6
2

You can download the word vectors glove.6B.zip from https://github.com/stanfordnlp/GloVe, unzip them and look at the file glove.6B.50d.txt.

There, you will find 400.000 English words, one in each line (plus 50 numbers per word in the same line), lower cased, sorted from most frequent (the) to least frequent. You can create a rank of words by reading this file in raw format or pandas.

It's not perfect, but I have used it in the past. The same website provides other files with up to 2.2m English words, cased.

tyrex
  • 8,208
  • 12
  • 43
  • 50
1

The Wiktionary project has a few frequency lists based on TV scripts and Project Gutenberg, but their format is not particularly nice for parsing.

Don Kirkby
  • 53,582
  • 27
  • 205
  • 286
0

Python 3 version of Christopher Pickslay's solution (incl. saving frequencies to tempdir):

from pathlib import Path
from pickle import dump, load
from tempfile import gettempdir

from nltk.probability import FreqDist


def get_word_frequencies() -> FreqDist:
  tmp_path = Path(gettempdir()) / "word_freq.pkl"
  if tmp_path.exists():
    with tmp_path.open(mode="rb") as f:
      word_frequencies = load(f)
  else:
    from nltk import download
    download('brown', quiet=True)
    from nltk.corpus import brown
    word_frequencies = FreqDist(word.lower() for sentence in brown.sents()
                                for word in sentence)
    with tmp_path.open(mode="wb") as f:
      dump(word_frequencies, f)

  return word_frequencies

Usage:

word_frequencies = get_word_frequencies()

print(word_frequencies["and"])
print(word_frequencies.freq("and"))

Output:

28853
0.02484774266443448
Stefan
  • 1
  • 3