In the python NLTK library, you can tokenise a sentence into individual words and punctuation. It will tokenise words that are not english and grammatically correct. How can i remove these tokens so all i have left is the grammatically correct, actual english words?
Example:
import nltk
sentence = "The word hello is gramatically correct but henlo is not"
sentence_tokenised = nltk.tokenize.word_tokenize(sentence)
print(sentence_tokenised)
This generates:
['The', 'word', 'hello', 'is', 'gramatically', 'correct', 'but', 'henlo', 'is', 'not']
'henlo' is not an english word. Is there a function that can parse these tokens and remove the invalid words like 'henlo'?