How to set custom stop words for sklearn CountVectorizer?

Question

I'm trying to run LDA (Latent Dirichlet Allocation) on a non-English text dataset.

From sklearn's tutorial, there's this part where you count term frequency of the words to feed into the LDA:

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                            max_features=n_features,
                            stop_words='english')

Which has built-in stop words feature which is only available for English I think. How could I use my own stop words list for this?

oh my, yeah it worked! should've read the documentation better next time. — troll, Oct 19 '16 at 07:18

score 23 · Accepted Answer · edited Dec 09 '22 at 10:23

23

You may just assign a list of your own words to the stop_words, e.g.:

stop_words = (["word1", "word2","word3"])

edited Dec 09 '22 at 10:23

nivalderramas

79
9

answered Oct 19 '16 at 07:20

Wiktor Stribiżew

607,720
39
448
563

Why a frozenset and not just a list? According to [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) a list is enough – nivalderramas Dec 08 '22 at 21:08
1

@nivalderramas Yeah, my link does not work now, previously, it showed the source code where `frozenset` was used. – Wiktor Stribiżew Dec 08 '22 at 21:16

How to set custom stop words for sklearn CountVectorizer?

1 Answers1

Linked