0

I have a list of words which I want to clean based on certain criteria. For example, remove a word if:

  1. contains a dot
  2. contains a number
  3. contains certain noisy keywords (http, https in this case but can be extended)
  4. is equal to 's
  5. its length is less than 3
  6. is a duplicate
  7. is punctuation

I wrote the following code and it does the job, however, I think it's not very clean, especially if I add a few more conditions to it.

unique_words = []
    ([unique_words.append(word) for word in doc_new if word not in unique_words
      and word not in string.punctuation and not any([token.isdigit() for token in word])
      and word != "'s" and len(word) > 2 and 'http' not in word and 'https' not in word
      and '.' not in word])

test example:

['http:', 'edition.cnn.com', '2017', '10', '25', 'asia', 'xi', 'jinping', 'china', 'trump', 'index.html']

output:

['asia', 'jinping', 'china', 'trump']

Is there a better way to do this in a slightly cleaner way?

Note: Python 3.x

utengr
  • 3,225
  • 3
  • 29
  • 68

3 Answers3

0

You can move filter criteria to another function

import string


def acceptance_function(word):
    if len(word) <= 3 or any([character in "0123456789." for character in word]) or 'http' in word:
        return False
    return True


items = ['http:', 'edition.cnn.com', '2017', '10', '25', 'asia', 'xi', 'jinping', 'china', 'trump', 'index.html']

filtered_items = filter(filter_criteria, items)

unique_items = list(set(filtered_items))

print unique_items
0

I'd avoid chaining boolean operators and instead do something like this:

import string

ILLEGAL_CHARACTERS = "0123456789."
ILLEGAL_KEYWORDS = {"http", "https"}


def filter(iterable_of_words):
    filtered_words = []
    for word in iterable_of_words:
        if len(word) < 3 or any(char in ILLEGAL_CHARACTERS for char in word):
            # Word is less than three characters long,
            # or contains illegal characters (numbers, period).
            continue
        elif any(keyword in word for keyword in ILLEGAL_KEYWORDS):
            # Word contains an illegal keyword.
            continue
        filtered_words.append(word)
    return filtered_words

We can skip checking if the word is a single character, like "s" or is punctuation, because the word must be at least three characters long.

Daniel
  • 769
  • 8
  • 21
0

You can use regex to simplify this. It can cover point 1,2,3,4,7 using below regex

r"[0-9.:;?!]|http|https|\'s|add|in|as|many|you|want"

Point 5 will automatically takes care of point 4 also.

import re
finallist = set()
for w in ['http:', 'edition.cnn.com','10', '2017', 'k10', '25', 'europe', '\'s', 'xi', 'jinping', 'china', 'trump', 'index.html']:
    if not (len(w)<3 or re.findall(r"[0-9.:;?!]|http|https|\'s|add|in|as|many|you|want", w)): finallist.add(w)
print finallist

Hope this helps