I am new to nltk. I am trying to understand the order of execution of various parameters in the CountVectorizer.
- Tokenizing - say, custom tokenizing with removal of words lesser than 3 chars. By default does the CountVextorizer allow hyphenated and underscored words e.g. Aug-2015, GPA 3.9 etc.
- Handling Upper and Lower Case
- Removal of stop words
- Removing words based on document frequency - max_df and min_df
- Finding bigrams
- Stemming - if added either as part of custom tokenization defnition or through analyzer as given in this post https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn