0

I am new to nltk. I am trying to understand the order of execution of various parameters in the CountVectorizer.

  1. Tokenizing - say, custom tokenizing with removal of words lesser than 3 chars. By default does the CountVextorizer allow hyphenated and underscored words e.g. Aug-2015, GPA 3.9 etc.
  2. Handling Upper and Lower Case
  3. Removal of stop words
  4. Removing words based on document frequency - max_df and min_df
  5. Finding bigrams
  6. Stemming - if added either as part of custom tokenization defnition or through analyzer as given in this post https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
Vandhana
  • 333
  • 5
  • 15

0 Answers0