I have a list of paragraphs, I would like to check if these words are valid English words or not. Sometimes, due to some external issues, i might not get valid English words in these paragraphs. I am aware of libraries like pyenchant and nltk which have a set of dictionaries and provide accuracy of some level but both of these have few drawbacks. I wonder if there exists another library or procedure that can provide me with what I am looking for with at-most accuracy possible.
-
Which are those drawbacks to you let's say of `pyenchant`? – Kostas Charitidis Oct 16 '19 at 07:28
-
Possible duplicate of [How to check if a word is an English word with Python?](https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python) – Kostas Charitidis Oct 16 '19 at 07:29
2 Answers
It depends greatly on what do you mean by valid English words. Is ECG, Thor or Loki a valid English word? If your definition of valid words is different, you might need to create your own language model. Anyway besides obvious use of pyEnchant or nltk I would suggest fasttext library. It has multiple pre-built word vector models and you can check your paragraph for rare or out-of-vocabulary words. What you essentially want to check is that the word embedding representation for this specific "non valid" word corresponds to low number (or zero) other words. You can use fasttext directly from python
pip install fasstext
or you can use gensim library (which will provide you some additional algorithms as well, such as Word2Vec which can be useful for your case as well)
pip install --upgrade gensim
Or for conda
conda install -c conda-forge gensim
Assuming you use gensim and you use pre-trained fasttext model:
from gensim.models import FastText
from gensim.test.utils import datapath
cap_path = datapath("fasttext-model.bin")
fb_model = load_facebook_model(cap_path)
Now you can perform several tasks to achieve your goal: 1. Check out-of-vocabulary
'mybizarreword' in fb_model.wv.vocab
- Check similarity
fb_model.wv.most_similar("man")
For rare word you will get low scores and by setting the threshold you will decide which word is not 'valid'

- 177
- 6
Linux and Mac OS X have a list of words which you can use directly, otherwise you can download a list of English words. You can use it as follows:
d = {}
fname = "/usr/share/dict/words"
with open(fname) as f:
content = f.readlines()
for w in content:
d[w.strip()] = True
p ="""I have a list of paragraphs, I would like to check if these words are valid English words or not. Sometimes, due to some external issues, i might not get valid English words in these paragraphs. I am aware of libraries like pyenchant and nltk which have a set of dictionaries and provide accuracy of some level but both of these have few drawbacks. I wonder if there exists another library or procedure that can provide me with what I am looking for with at-most accuracy possible."""
lw = []
for w in p.split():
if len(w) < 4:
continue
if d.get(w, False):
lw.append(w)
print(len(lw))
print(lw)
#43
#['have', 'list', 'would', 'like', 'check', 'these', 'words', 'valid', 'English', 'words', 'some', 'external', 'might', 'valid', 'English', 'words', 'these', 'aware', 'libraries', 'like', 'which', 'have', 'dictionaries', 'provide', 'accuracy', 'some', 'level', 'both', 'these', 'have', 'wonder', 'there', 'exists', 'another', 'library', 'procedure', 'that', 'provide', 'with', 'what', 'looking', 'with', 'accuracy']

- 2,079
- 9
- 15