Are max_df and min_df limits applied after or before stop word removal and finding bigrams in a CountVectorizer

Asked Aug 29 '18 at 21:43

Active Aug 30 '18 at 12:10

Viewed 109 times

I am new to nltk. I am trying to understand the order of execution of various parameters in the CountVectorizer.

Tokenizing - say, custom tokenizing with removal of words lesser than 3 chars. By default does the CountVextorizer allow hyphenated and underscored words e.g. Aug-2015, GPA 3.9 etc.
Handling Upper and Lower Case
Removal of stop words
Removing words based on document frequency - max_df and min_df
Finding bigrams
Stemming - if added either as part of custom tokenization defnition or through analyzer as given in this post https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn

edited Aug 30 '18 at 12:10

Vivek Kumar

asked Aug 29 '18 at 21:43

Vandhana

Good question!! – alvas Aug 30 '18 at 03:47
1

It is done at last step after stop word removal, and finding vocabulary (bigrams). See my [answer here](https://stackoverflow.com/a/49775000/3374996) for the working – Vivek Kumar Aug 30 '18 at 12:10

0 Answers0