2

My problem: I want to check if the provided word is a common English word. I'm using pyenchant currently to see if a word is an actual word but I can't find a function in it that returns the frequency of a word/if it's a common word.

Example code:

import enchant
eng_dict = enchant.Dict("en_US")

words = ['hello', 'world', 'thisisntaword', 'anachronism']
good_words = []

for word in words:
    if eng_dict.check(word): # currently this checks if it's an english word, but I also want it to check it it's commonly used word
        good_words.append(word)
print(good_words)

What it returns as is: ['hello', 'world', 'anachronism']. What I want it to return:['hello', 'world'] because anachronism is obviously not a common word.

Any solutions my problem?

Brian C
  • 73
  • 7

2 Answers2

3

You could use the Google Ngram API for this:

url = "https://books.google.com/ngrams/json"

query_params = {
        "content": <my_noun_phrase/string of noun phrases>,
        "year_start": 2017,
        "year_end": 2019,
        "corpus": 26,
        "smoothing": 1,
        "case_insensitive": True
    }
response = requests.get(url=url, params=query_params)

This API lets you access v3 of the Google ngram database, which is the most recent version available. Note, however, that the API is not officially documented, and since you run into rate limits quite easily, it's not production-proof. Alternative tools are PhraseFinder (https://phrasefinder.io/) and NGRAMS (https://ngrams.dev/). PhraseFinder is a wrapper around v2 of the Google ngram database; NGRAM is a wrapper around v3 of the same database. They are both free and can handle more traffic than the Google API.

mr_faulty
  • 103
  • 7
2

You can try using the package wordfreq : https://pypi.org/project/wordfreq/ The only issue would be that you have to define a limit.

from wordfreq import word_frequency
lan = 'en'
words = ['hello', 'world', 'thisisntaword', 'anachronism']
good_words = []

def IsFrequent(word, lan, limit):
    return word_frequency(word, lan) > limit 

for word in words:
    if IsFrequent(word, lan, 0.00001):
        good_words.append(word)
print(good_words)

output: ['hello', 'world']

using zipf_frequency seems even more interesting since the output between 0 and 10 is in a logarithmic scale.

"The Zipf frequency of a word is the base-10 logarithm of the number of times it appears per one billion words. A word with Zipf value 6 appears once per one thousand words, for example, and a word with Zipf value 3 appears once per one million words."

nullromo
  • 2,165
  • 2
  • 18
  • 39
tstx
  • 151
  • 2
  • 12