NLTK Remove invalid words

Question

In the python NLTK library, you can tokenise a sentence into individual words and punctuation. It will tokenise words that are not english and grammatically correct. How can i remove these tokens so all i have left is the grammatically correct, actual english words?

Example:

import nltk

sentence = "The word hello is gramatically correct but henlo is not"
sentence_tokenised = nltk.tokenize.word_tokenize(sentence)
print(sentence_tokenised)

This generates:

['The', 'word', 'hello', 'is', 'gramatically', 'correct', 'but', 'henlo', 'is', 'not']

'henlo' is not an english word. Is there a function that can parse these tokens and remove the invalid words like 'henlo'?

score 2 · Accepted Answer · answered May 29 '21 at 22:16

Based on NLTK documentation here:

A tokenizer that divides a string into substrings by splitting on the specified string (defined in subclasses).

So what it does, it is only dividing a string into substring. If you want to filter words that are not in nltk.corpus.words() you can download the words one time:

import nltk
nltk.download('words')

and then after that:

import nltk
from nltk.corpus import words

sentence = "The word hello is gramatically correct but henlo is not"
sentence_tokenised = nltk.tokenize.word_tokenize(sentence)

output = list(filter(lambda x: x in words.words(), sentence_tokenised))

output:

['The', 'word', 'hello', 'is', 'correct', 'but', 'is', 'not']

NLTK Remove invalid words

1 Answers1