0

I have a classification problem where my labels are ratings, 0 - 100, with increments of 1 (e.g. 1, 2, 3, 4,).

I have a data set where each row has a name, text corpus, and a rating (0 - 100).

From the text corpus I am trying to extract features that I can feed into my classifier, which will output a corresponding rating per row (0 - 100).

For feature selection, I am thinking of starting with basic bag of words. My question lies in the classification algorithm, however. Is there a classification algorithm in sci-kit learn that supports this kind of problem?

I was reading http://scikit-learn.org/stable/modules/multiclass.html, but the algorithms described seem to support labels that are completely discrete, whereas I have a set of continuous labels.

EDIT: What about the case where I bin my ratings? For example, I can have 10 labels, each 1- 10.

jeffrey
  • 3,196
  • 7
  • 26
  • 44
  • 2
    If you acknowledge your classes are continuous, why don't use regression instead? – Artem Sobolev Nov 04 '14 at 08:22
  • Ah, I am not familiar with regression, it seems like it is the natural solution to this problem? – jeffrey Nov 04 '14 at 17:22
  • 2
    Yes, when your target variable is some sort of continuous value where deviations don't matter (It's okay to predict 36 instead of 37, but it's not okay to predict 90 instead of 11). What you really want to do is to minimize to probability of predicting wrong value, but probability of predicting distant value. And this is what regression algorithms are used for. Any algorithm whose name ends with Regressor will work. – Artem Sobolev Nov 04 '14 at 18:49

2 Answers2

1

You can preprocess your data with OneHotEncoder to convert your one 1-to-100 feature into 100 binary features corresponding to each value of interval [1..100]. Then you'll have 100 labels and learn a multiclass classifier.

Though, I suggest to use Regression instead.

Artem Sobolev
  • 5,891
  • 1
  • 22
  • 40
1

You can use multi-variate regression instead of classification. U can cluster the n-gram features from text corpus to form a dictionary and use it to form a feature set. With this feature set, train a regression model where output can be continuous values. U can round the output real number to get a discrete label in 1-100

  • Ah, it seems that multi-variate regression is indeed a more natural solution. Would the scikit-learn regression take care of this? I am assuming logistic regression is a classifier algorithm and not what you are referring to. – jeffrey Nov 04 '14 at 17:24
  • 1
    Yes, logistic regression is a classification algorithm. You can try linear regression or ridge regression or random forest regression. – Andreas Mueller Nov 04 '14 at 23:59
  • 1
    look at scikit-learn.org/stable/modules/linear_model.html for linear and polynomial regression. u might have to try with different polynomial models to come up with the one which would suit u best. I think u should start with a linear model first and then try other polynomial variants later.. Other suggestion would be to also look for regression forests if this doesnt work as per ur need – Mujtaba Hasan Nov 05 '14 at 08:27