Python: How to remove/discard a string from a set of strings using case-insensitive match?

Question

I have a case from Wikidata where the string Articles containing video clips shows up in a set of categories and needs to be removed. Trouble is, it also shows up in other sets as articles containing video clips (lowercase "a").

The simple/safe way to remove it seems to be

   setA.discard("Articles containing video clips").discard("articles containing video clips")

Perfectly adequate, but doesn't scale in complex cases. Is there any way to do this differently, other than the obvious loop or list/set comprehension using, say, casefold for the comparison?

  unwantedString = 'Articles containing video clip'
  setA = {'tsunami', 'articles containing video clip'}

  reducedSetA = {nonmatch for nonmatch in setA if nonmatch.casefold() != 
      unwantedString.casefold }

  print(reducedSetA)
  {'tsunami'}

Note that this is not a string replacement situation - it is removal of a string from a set of strings.

That "simple/safe" way will not work at all, since discard operates in place and returns `None`. — wim, Mar 06 '23 at 22:22
What if you used the built-in `filter` function, and pass something like `lambda x: x in unwantedStrings` as the key? Or am I just missing something? (`set(filter(lambda x: x in unwantedStrings, setA))`) — Jason Grace, Mar 06 '23 at 22:27

Pranav Hosangadi · Answer 1 · 2023-03-07T01:29:42.737

The problem with implementing this using a set comprehension as you do is that an O(1) operation is turned into an O(N) operation, since you need to check item.casefold() != unwanted_String.casefold() for each item in the set.

One option to work around this would be to keep a dictionary that stores the strings in a set with a lowercased key. When you want to delete an element, find all elements that have the same lowercase value, and delete those too. You could write a class to handle this that would look like so:

class EasyRemoveSet(set):
    def __init__(self, *args, key_func=str.casefold, **kwargs):
        super().__init__(*args, **kwargs)
        self.__key_func = key_func
        self.__lookup = {}
        self.__add_to_lookup(self)
        
    def __add_to_lookup(self, elems):
        for elem in elems:
            self.__lookup.setdefault(self.__key_func(elem), set()).add(elem)

    def add(self, elem):
        super().add(elem)
        self.__add_to_lookup([elem])

    def remove(self, elem):
        elems_to_remove = self.__lookup.pop(self.__key_func(elem))
        for e in elems_to_remove:
            super().remove(e)

    def discard(self, elem):
        elems_to_remove = self.__lookup.pop(self.__key_func(elem), [])
        for e in elems_to_remove:
            super().discard(e)
            
    def clear(self):
        super().clear()
        self.__lookup = {}

Then, you can do:

setA = EasyRemoveSet(["abc", "Abc", "def", "DeF", "ABC", "abC", "DEF", "abc"])
print(setA) # EasyRemoveSet({'abc', 'DEF', 'DeF', 'ABC', 'abC', 'def', 'Abc'})

setA.remove("Abc")
print(setA) # EasyRemoveSet({'DEF', 'DeF', 'def'})

The keyword-only argument key_func allows you to specify a callable whose return value will be used as the key to identify duplicates. For example, if you wanted to use this class for integers, and remove negative and positive integers in one go:

num_set = EasyRemoveSet([1, 2, 3, 4, 5, -1, -2, -3, -4, -5], key_func=abs)
print(num_set)
# EasyRemoveSet({1, 2, 3, 4, 5, -2, -5, -4, -3, -1})

num_set.discard(-5)
print(num_set)
# EasyRemoveSet({1, 2, 3, 4, -2, -4, -3, -1})

score 0 · Accepted Answer · answered Mar 06 '23 at 22:57

You can also use regex.

import re

unwantedStrings = {"Articles containing video clip", "asdf"}
setA = {"tsunami", "articles containing video clip", "asdf", "asdfasdf", "asdfasddf"}

# remove the unwanted strings from the set
regex = re.compile("|".join(map(lambda s: "^" + s + "$", unwantedStrings)), re.IGNORECASE)
reducedSetA = set(filter(lambda x: not regex.search(x), setA))

print(reducedSetA)
# {'tsunami', 'asdfasddf', 'asdfasdf'}

The above code will remove only the exact matches. If you also want to remove the "asdfasdf" because you have "asdf" in unwanted string. You can change the regex line to this line.

...
regex = re.compile("|".join(unwantedStrings), re.IGNORECASE)
...
# {'tsunami'}

Python: How to remove/discard a string from a set of strings using case-insensitive match?

2 Answers2