I have a list of words which I want to clean based on certain criteria. For example, remove a word if:
- contains a dot
- contains a number
- contains certain noisy keywords (http, https in this case but can be extended)
- is equal to
's
- its length is less than 3
- is a duplicate
- is punctuation
I wrote the following code and it does the job, however, I think it's not very clean, especially if I add a few more conditions to it.
unique_words = []
([unique_words.append(word) for word in doc_new if word not in unique_words
and word not in string.punctuation and not any([token.isdigit() for token in word])
and word != "'s" and len(word) > 2 and 'http' not in word and 'https' not in word
and '.' not in word])
test example:
['http:', 'edition.cnn.com', '2017', '10', '25', 'asia', 'xi', 'jinping', 'china', 'trump', 'index.html']
output:
['asia', 'jinping', 'china', 'trump']
Is there a better way to do this in a slightly cleaner way?
Note: Python 3.x