0

I have a multi-class text classification/categorization problem. I have a set of ground truth data with K different mutually exclusive classes. This is an unbalanced problem in two respects. First, some classes are a lot more frequent than others. Second, some classes are of more interest to us than others (those generally positively correlate with their relative frequency, although there are some classes of interest that are fairly rare).

My goal is to develop a single classifier or a collection of them to be able to classify the k << K classes of interest with high precision (at least 80%) while maintaining reasonable recall (what's "reasonable" is a bit vague).

Features that I use are mostly typical unigram-/bigram-based ones plus some binary features coming from metadata of the incoming documents that are being classified (e.g. whether them were submitted via email or though a webform).

Because of the unbalanced data, I am leaning toward developing binary classifiers for each of the important classes, instead of a single one like a multi-class SVM.

What ML learning algorithms (binary or not) implemented in scikit-learn allow for training tuned to precision (versus for example recall or F1) and what options do I need to set for that?

What data analysis tools in scikit-learn can be used for feature selection to narrow down the features that might be the most relevant to the precision-oriented classification of a particular class?

This is not really a "big data" problem: K is about 100, k is about 15, the total number of samples available to me for training and testing is about 100,000.

Thx

user2314737
  • 27,088
  • 20
  • 102
  • 114
I Z
  • 5,719
  • 19
  • 53
  • 100

1 Answers1

0

Given that k is small, I would just do this manually. For each desired class, train your individual (one vs the rest) classifier, take look at the precision-recall curve, and then choose the threshold that gives the desired precision.

James Atwood
  • 4,289
  • 2
  • 17
  • 17