Delete item from list if it contains a substring from a "blacklist"

Question

In python, I'd like to remove from a list any string which contains a substring found in a so called "blacklist".

For example, assume list A is the following:

A = [ 'cat', 'doXXXg', 'monkey', 'hoBBBrse', 'fish', 'snake']

and list B is:

B = ['XXX', 'BBB']

how could I get list C:

C = [ 'cat', 'monkey', 'fish', 'snake']

I've played around with various combinations of regex expressions and list comprehensions but I can't seem to get it to work.

Why use regexes? See [this](http://stackoverflow.com/questions/3437059/does-python-have-a-string-contains-method). — ThaMe90, Nov 14 '14 at 14:53
I m quite curious about the people who have put a `Downvote` to this question !!!!!!!!!!!!!!!!!!!!+1 — vks, Nov 14 '14 at 15:50

score 14 · Answer 1 · answered Nov 14 '14 at 14:53

14

>>> A = [ 'cat', 'doXXXg', 'monkey', 'hoBBBrse', 'fish', 'snake']
>>> B = ['XXX', 'BBB']

The following list comprehension will work

>>> [word for word in A if not any(bad in word for bad in B)]
['cat', 'monkey', 'fish', 'snake']

answered Nov 14 '14 at 14:53

Cory Kramer

114,268
16
167
218

This answer should be the accepted answer due to if being a shorter, more pythonic way of solving OP's problem and it doesn't require additional modules. – Haddock-san May 12 '20 at 14:29

Martijn Pieters · Accepted Answer · 2014-11-14T15:04:38.847

You could join the blacklist into one expression:

import re

blacklist = re.compile('|'.join([re.escape(word) for word in B]))

then filter words out if they match:

C = [word for word in A if not blacklist.search(word)]

Words in the pattern are escaped (so that . and other meta characters are not treated as such, but as literal characters instead), and joined into a series of | alternatives:

>>> '|'.join([re.escape(word) for word in B])
'XXX|BBB'

Demo:

>>> import re
>>> A = [ 'cat', 'doXXXg', 'monkey', 'hoBBBrse', 'fish', 'snake']
>>> B = ['XXX', 'BBB']
>>> blacklist = re.compile('|'.join([re.escape(word) for word in B]))
>>> [word for word in A if not blacklist.search(word)]
['cat', 'monkey', 'fish', 'snake']

This should outperform any explicit membership testing, especially as the number of words in your blacklist grows:

>>> import string, random, timeit
>>> def regex_filter(words, blacklist):
...     [word for word in A if not blacklist.search(word)]
... 
>>> def any_filter(words, blacklist):
...     [word for word in A if not any(bad in word for bad in B)]
... 
>>> words = [''.join([random.choice(string.letters) for _ in range(random.randint(3, 20))])
...          for _ in range(1000)]
>>> blacklist = [''.join([random.choice(string.letters) for _ in range(random.randint(2, 5))])
...              for _ in range(10)]
>>> timeit.timeit('any_filter(words, blacklist)', 'from __main__ import any_filter, words, blacklist', number=100000)
0.36232495307922363
>>> timeit.timeit('regex_filter(words, blacklist)', "from __main__ import re, regex_filter, words, blacklist; blacklist = re.compile('|'.join([re.escape(word) for word in blacklist]))", number=100000)
0.2499098777770996

The above tests 10 random blacklisted short words (2 - 5 characters) against a list of 1000 random words (3 - 20 characters long), the regex is about 50% faster.

Well, the `any()` test *could* be faster if the likelihood of a match early on in the blacklist is high (or the blacklist is very small). Always measure on a reasonable modelling of your actual circumstances! — Martijn Pieters, Nov 14 '14 at 15:09
In my case, the black list only contains 10 or less words but that being said the solution you propose is very elegant. — precicely, Nov 14 '14 at 15:13
@user1182556: with 10 words my solution already is faster. :-) — Martijn Pieters, Nov 14 '14 at 15:14

Delete item from list if it contains a substring from a "blacklist"

2 Answers2

Linked

Related