A cleaner way for multiple conditions in a list comprehension

Question

I have a list of words which I want to clean based on certain criteria. For example, remove a word if:

contains a dot
contains a number
contains certain noisy keywords (http, https in this case but can be extended)
is equal to 's
its length is less than 3
is a duplicate
is punctuation

I wrote the following code and it does the job, however, I think it's not very clean, especially if I add a few more conditions to it.

unique_words = []
    ([unique_words.append(word) for word in doc_new if word not in unique_words
      and word not in string.punctuation and not any([token.isdigit() for token in word])
      and word != "'s" and len(word) > 2 and 'http' not in word and 'https' not in word
      and '.' not in word])

test example:

['http:', 'edition.cnn.com', '2017', '10', '25', 'asia', 'xi', 'jinping', 'china', 'trump', 'index.html']

output:

['asia', 'jinping', 'china', 'trump']

Is there a better way to do this in a slightly cleaner way?

Note: Python 3.x

maybe you could find a regular expression for that... And you don't need the "'s" part because its smaller than 3 — Florian H, Oct 26 '17 at 09:31
@FlorianH that's true. I will remove it in the final version. — utengr, Oct 26 '17 at 09:49

Jakub Niwa · Answer 1 · 2017-10-26T10:01:10.367

You can move filter criteria to another function

import string


def acceptance_function(word):
    if len(word) <= 3 or any([character in "0123456789." for character in word]) or 'http' in word:
        return False
    return True


items = ['http:', 'edition.cnn.com', '2017', '10', '25', 'asia', 'xi', 'jinping', 'china', 'trump', 'index.html']

filtered_items = filter(filter_criteria, items)

unique_items = list(set(filtered_items))

print unique_items

score 0 · Answer 2 · answered Oct 26 '17 at 09:45

I'd avoid chaining boolean operators and instead do something like this:

import string

ILLEGAL_CHARACTERS = "0123456789."
ILLEGAL_KEYWORDS = {"http", "https"}


def filter(iterable_of_words):
    filtered_words = []
    for word in iterable_of_words:
        if len(word) < 3 or any(char in ILLEGAL_CHARACTERS for char in word):
            # Word is less than three characters long,
            # or contains illegal characters (numbers, period).
            continue
        elif any(keyword in word for keyword in ILLEGAL_KEYWORDS):
            # Word contains an illegal keyword.
            continue
        filtered_words.append(word)
    return filtered_words

We can skip checking if the word is a single character, like "s" or is punctuation, because the word must be at least three characters long.

Piyush Patel · Answer 3 · 2017-10-26T11:14:14.387

You can use regex to simplify this. It can cover point 1,2,3,4,7 using below regex

r"[0-9.:;?!]|http|https|\'s|add|in|as|many|you|want"

Point 5 will automatically takes care of point 4 also.

import re
finallist = set()
for w in ['http:', 'edition.cnn.com','10', '2017', 'k10', '25', 'europe', '\'s', 'xi', 'jinping', 'china', 'trump', 'index.html']:
    if not (len(w)<3 or re.findall(r"[0-9.:;?!]|http|https|\'s|add|in|as|many|you|want", w)): finallist.add(w)
print finallist

Hope this helps

A cleaner way for multiple conditions in a list comprehension

3 Answers3