0

I'm new to scikit and working with text data in general, and I've been using the sci-kit CountVectorizer as a start to get used to basic features of text data (n-grams) but I want to extend this to analyze for other features.

I would prefer to adapt the countvectorizer rather than make my own, because then I wouldn't have to reimplement sci-kits tf-idf transformer and classifier.

EDIT:

I'm actually still thinking about specific features to be honest, but for my project I wanted to do style classification between documents. I know that for text classification, lemmatizing and stemming is popular for feature extraction, so that may be one. Other features that I am thinking of analyzing include

  • Length of sentences per document within each style
  • Distinct words per style. A more formal style may have a more eloquent and varied vocabulary
    • An offshoot of the previous point, but counts of adjectives in particular
  • Lengths of particular words, again, slang might use much shorter phrases than a formal style
  • Punctuation, especially marked pauses between statements, lengths of statements

These are a few ideas I was thinking of, but I'm thinking of more features to test!

Nice-kun
  • 81
  • 1
  • 2
  • 9

2 Answers2

1

You can easily extend extend the class (you can see the source of it here) and implement what you need. However, it depends on what you want to do, which is not very clear in your question.

Tarantula
  • 19,031
  • 12
  • 54
  • 71
  • hmm I think yeah, my question is loaded, I need to give this some more thought, on the types of specific features. I might need to change the tokenizer too depending on what I'm looking for. Thank you! – Nice-kun Apr 22 '15 at 18:27
1

Are you asking how to implement the features you listed in terms of a scikit-learn compatible transformer? Then maybe have a look at the developer docs in particular rolling your own estimator.

You can just inherit from BaseEstimator and implement a fit and a transform. That is only necessary if you want to use pipelining, though. For using sklearn classifiers and the tfidf-transformer, it is only necessary that your feature extraction creates numpy arrays or scipy sparse matrices.

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
  • Yes, that's exactly what I want to do! I'll look into those thank you, I don't want to have to reinvent the wheel, so I'll be so happy if I can just use these to implement my own features. – Nice-kun Apr 25 '15 at 22:19