I'm new to scikit and working with text data in general, and I've been using the sci-kit CountVectorizer as a start to get used to basic features of text data (n-grams) but I want to extend this to analyze for other features.
I would prefer to adapt the countvectorizer rather than make my own, because then I wouldn't have to reimplement sci-kits tf-idf transformer and classifier.
EDIT:
I'm actually still thinking about specific features to be honest, but for my project I wanted to do style classification between documents. I know that for text classification, lemmatizing and stemming is popular for feature extraction, so that may be one. Other features that I am thinking of analyzing include
- Length of sentences per document within each style
- Distinct words per style. A more formal style may have a more eloquent and varied vocabulary
- An offshoot of the previous point, but counts of adjectives in particular
- Lengths of particular words, again, slang might use much shorter phrases than a formal style
- Punctuation, especially marked pauses between statements, lengths of statements
These are a few ideas I was thinking of, but I'm thinking of more features to test!