Python Library to find out valid English words in a paragraph

Question

I have a list of paragraphs, I would like to check if these words are valid English words or not. Sometimes, due to some external issues, i might not get valid English words in these paragraphs. I am aware of libraries like pyenchant and nltk which have a set of dictionaries and provide accuracy of some level but both of these have few drawbacks. I wonder if there exists another library or procedure that can provide me with what I am looking for with at-most accuracy possible.

Possible duplicate of [How to check if a word is an English word with Python?](https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python) — Kostas Charitidis, Oct 16 '19 at 07:29

score 1 · Answer 1 · answered Oct 16 '19 at 09:09

It depends greatly on what do you mean by valid English words. Is ECG, Thor or Loki a valid English word? If your definition of valid words is different, you might need to create your own language model. Anyway besides obvious use of pyEnchant or nltk I would suggest fasttext library. It has multiple pre-built word vector models and you can check your paragraph for rare or out-of-vocabulary words. What you essentially want to check is that the word embedding representation for this specific "non valid" word corresponds to low number (or zero) other words. You can use fasttext directly from python

pip install fasstext

or you can use gensim library (which will provide you some additional algorithms as well, such as Word2Vec which can be useful for your case as well)

pip install --upgrade gensim

Or for conda

conda install -c conda-forge gensim

Assuming you use gensim and you use pre-trained fasttext model:

from gensim.models import FastText
from gensim.test.utils import datapath

cap_path = datapath("fasttext-model.bin")
fb_model = load_facebook_model(cap_path)

Now you can perform several tasks to achieve your goal: 1. Check out-of-vocabulary

'mybizarreword' in fb_model.wv.vocab

Check similarity

fb_model.wv.most_similar("man")

For rare word you will get low scores and by setting the threshold you will decide which word is not 'valid'

score -1 · Answer 2 · answered Oct 17 '19 at 13:02

Linux and Mac OS X have a list of words which you can use directly, otherwise you can download a list of English words. You can use it as follows:

d = {}
fname = "/usr/share/dict/words"
with open(fname) as f:
    content = f.readlines()

for w in content:
    d[w.strip()] = True

p ="""I have a list of paragraphs, I would like to check if these words are valid English words or not. Sometimes, due to some external issues, i might not get valid English words in these paragraphs. I am aware of libraries like pyenchant and nltk which have a set of dictionaries and provide accuracy of some level but both of these have few drawbacks. I wonder if there exists another library or procedure that can provide me with what I am looking for with at-most accuracy possible."""

lw = []
for w in p.split():
    if len(w) < 4:
        continue
    if d.get(w, False):
        lw.append(w)

print(len(lw))
print(lw)

#43
#['have', 'list', 'would', 'like', 'check', 'these', 'words', 'valid', 'English', 'words', 'some', 'external', 'might', 'valid', 'English', 'words', 'these', 'aware', 'libraries', 'like', 'which', 'have', 'dictionaries', 'provide', 'accuracy', 'some', 'level', 'both', 'these', 'have', 'wonder', 'there', 'exists', 'another', 'library', 'procedure', 'that', 'provide', 'with', 'what', 'looking', 'with', 'accuracy']

Python Library to find out valid English words in a paragraph

2 Answers2