Is it possible to adapt the sci-kit CountVectorizer for other features (not just n-grams)?

Question

I'm new to scikit and working with text data in general, and I've been using the sci-kit CountVectorizer as a start to get used to basic features of text data (n-grams) but I want to extend this to analyze for other features.

I would prefer to adapt the countvectorizer rather than make my own, because then I wouldn't have to reimplement sci-kits tf-idf transformer and classifier.

EDIT:

I'm actually still thinking about specific features to be honest, but for my project I wanted to do style classification between documents. I know that for text classification, lemmatizing and stemming is popular for feature extraction, so that may be one. Other features that I am thinking of analyzing include

Length of sentences per document within each style
Distinct words per style. A more formal style may have a more eloquent and varied vocabulary
- An offshoot of the previous point, but counts of adjectives in particular
Lengths of particular words, again, slang might use much shorter phrases than a formal style
Punctuation, especially marked pauses between statements, lengths of statements

These are a few ideas I was thinking of, but I'm thinking of more features to test!

What kind of feature extraction do you want to do? – Andreas Mueller Apr 23 '15 at 02:45 — Andreas Mueller, Apr 23 '15 at 02:45
@AndreasMueller I added some details! – Nice-kun Apr 23 '15 at 04:10 — Nice-kun, Apr 23 '15 at 04:10

score 1 · Answer 1 · answered Apr 22 '15 at 18:05

1

You can easily extend extend the class (you can see the source of it here) and implement what you need. However, it depends on what you want to do, which is not very clear in your question.

answered Apr 22 '15 at 18:05

Tarantula

19,031
12
54
71

hmm I think yeah, my question is loaded, I need to give this some more thought, on the types of specific features. I might need to change the tokenizer too depending on what I'm looking for. Thank you! – Nice-kun Apr 22 '15 at 18:27

score 1 · Answer 2 · answered Apr 25 '15 at 21:00

1

Are you asking how to implement the features you listed in terms of a scikit-learn compatible transformer? Then maybe have a look at the developer docs in particular rolling your own estimator.

You can just inherit from BaseEstimator and implement a fit and a transform. That is only necessary if you want to use pipelining, though. For using sklearn classifiers and the tfidf-transformer, it is only necessary that your feature extraction creates numpy arrays or scipy sparse matrices.

answered Apr 25 '15 at 21:00

Andreas Mueller

27,470
8
62
74

Yes, that's exactly what I want to do! I'll look into those thank you, I don't want to have to reinvent the wheel, so I'll be so happy if I can just use these to implement my own features. – Nice-kun Apr 25 '15 at 22:19

Is it possible to adapt the sci-kit CountVectorizer for other features (not just n-grams)?

2 Answers2