In the past, you were expected to pass a set as stop_words parameter in sklearn.feature_extraction.text.CountVectorizer(). For example, look at answers to this SO question here: adding words to stop_words list in TfidfVectorizer in sklearn However, if you pass a set as stop_words now, it throws an error that The 'stop_words' parameter of CountVectorizer must be a str among {'english'}, an instance of 'list' or None.
For example, look at the answer to this question: how can I solve the error: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None?
Can you point me to which specific commit(s) in sklearn this change was introduced?
What I have tried
I looked into scikit-learn repo. There still is this function that seems to accept any collection. I want to understand why these vectorizers no longer accept set for stop_words
.
def _check_stop_list(stop):
if stop == "english":
return ENGLISH_STOP_WORDS
elif isinstance(stop, str):
raise ValueError("not a built-in stop list: %s" % stop)
elif stop is None:
return None
else: # assume it's a collection
return frozenset(stop)