From which commit(s) did sklearn change expected type of stop_words in CountVectorizer to be list instead of set?

Question

In the past, you were expected to pass a set as stop_words parameter in sklearn.feature_extraction.text.CountVectorizer(). For example, look at answers to this SO question here: adding words to stop_words list in TfidfVectorizer in sklearn However, if you pass a set as stop_words now, it throws an error that The 'stop_words' parameter of CountVectorizer must be a str among {'english'}, an instance of 'list' or None. For example, look at the answer to this question: how can I solve the error: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None? Can you point me to which specific commit(s) in sklearn this change was introduced?

What I have tried
I looked into scikit-learn repo. There still is this function that seems to accept any collection. I want to understand why these vectorizers no longer accept set for stop_words.

def _check_stop_list(stop):
    if stop == "english":
        return ENGLISH_STOP_WORDS
    elif isinstance(stop, str):
        raise ValueError("not a built-in stop list: %s" % stop)
    elif stop is None:
        return None
    else:  # assume it's a collection
        return frozenset(stop)

Someone has voted to close this as seeking recommendations; I don't get it... — desertnaut, Aug 14 '23 at 01:12

From which commit(s) did sklearn change expected type of stop_words in CountVectorizer to be list instead of set?

0 Answers0