6

The target variable that I need to predict are probabilities (as opposed to labels). The corresponding column in my training data are also in this form. I do not want to lose information by thresholding the targets to create a classification problem out of it.

If I train the logistic regression classifier with binary labels, sk-learn logistic regression API allows obtaining the probabilities at prediction time. However, I need to train it with probabilities. Is there a way to do this in scikits-learn, or a suitable Python package that scales to 100K data points of 1K dimension.

san
  • 4,144
  • 6
  • 32
  • 50
  • Just use any regressor (working on continuous targets). Logreg is not one of those (despite the name)! – sascha Dec 05 '17 at 22:27
  • @Sascha I want the regressor to use the structure of the problem. One such structure is that the targets are probabilities. – san Dec 05 '17 at 22:31
  • And what is the information-theoretic difference (except for bounds)? – sascha Dec 05 '17 at 22:35
  • Maybe I understand this incorrectly, but what stops you from adding `proba_Label1`, `proba_Label2` etc. as `pd.Series` concatenated to your training dataframe? You can use them as any other numerical label in `LogisticRegressionClassifier.fit()`. Afterwards you can both `predict` or `predict_probas`, whatever are your needs. – jo9k Dec 05 '17 at 22:36
  • @sacha if you had to ask that, you would need a lot of background. Unfortunately SO is not the right venue for me to answer that. – san Dec 05 '17 at 22:41
  • You are asking to have multiple regression outputs per sample. Thats called multi-output regression. At present scikit-learn dont have any inbuilt algorithm to handle them. You can use [MultiOutputRegressor](http://scikit-learn.org/stable/modules/multiclass.html#multioutput-regression) but that will use separate regressors for each target. So I dont know if that can handle the relationships between all the outputs. – Vivek Kumar Dec 06 '17 at 02:09
  • This question should not be down-voted. It is a very legitimate question with obvious applications. For example, imagine the relatively obvious application of estimating a rating function from 0 to 1 where your training data may have decimal values. I'm really disappointed that this question was down-voted by the community. – David R Jan 13 '18 at 18:37

3 Answers3

2

I want the regressor to use the structure of the problem. One such structure is that the targets are probabilities.

You can't have cross-entropy loss with non-indicator probabilities in scikit-learn; this is not implemented and not supported in API. It is a scikit-learn's limitation.

In general, according to scikit-learn's docs a loss function is of the form Loss(prediction, target), where prediction is the model's output, and target is the ground-truth value.

In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").


For logistic regression you can approximate probabilities as target by oversampling instances according to probabilities of their labels. e.g. if for given sample class_1 has probability 0.2, and class_2 has probability0.8, then generate 10 training instances (copied sample): 8 withclass_2as "ground truth target label" and 2 withclass_1`.

Obviously it is workaround and is not extremely efficient, but it should work properly.

If you're ok with upsampling approach, you can pip install eli5, and use eli5.lime.utils.fit_proba with a Logistic Regression classifier from scikit-learn.


Alternative solution is to implement (or find implementation?) of LogisticRegression in Tensorflow, where you can define loss function as you like it.


In compiling this solution I worked using answers from scikit-learn - multinomial logistic regression with probabilities as a target variable and scikit-learn classification on soft labels. I advise those for more insight.

jo9k
  • 690
  • 6
  • 19
  • 1
    Thanks that was very helpful. I would loathe to go the up-sample route though. – san Dec 06 '17 at 09:59
  • I advise strongly to use Tensorflow then. Implementing logistic regression with custom (working on non-indicator probabilities) cross-entropy as loss function should be shorter than hundred lines of code. – jo9k Dec 06 '17 at 10:36
  • 1
    One annoying issue with over-sampling is that it requires an extra step in X-validation. You cannot just generate the data and do standard cross-validation because the points are no longer independent. You have to manually break up your base training data and into train/test sets and over-sample after the separation. That being said, this is (sadly) the most common simple solution if you are working in a platform that does not allow custom losses. – David R Jan 13 '18 at 18:40
  • @DavidR That is indeed a common bug. Have encountered it often. – san Jan 22 '18 at 12:39
2

This is an excellent question because (contrary to what people might believe) there are many legitimate uses of logistic regression as.... regression!

There are three basic approaches you can use if you insist on true logistic regression, and two additional options that should give similar results. They all assume your target output is between 0 and 1. Most of the time you will have to generate training/test sets "manually," unless you are lucky enough to be using a platform that supports SGD-R with custom kernels and X-validation support out-of-the-box.

Note that given your particular use case, the "not quite true logistic regression" options may be necessary. The downside of these approaches is that it is takes more work to see the weight/importance of each feature in case you want to reduce your feature space by removing weak features.

Direct Approach using Optimization

If you don't mind doing a bit of coding, you can just use scipy optimize function. This is dead simple:

  1. Create a function of the following type: y_o = inverse-logit (a_0 + a_1x_1 + a_2x_2 + ...)

where inverse-logit (z) = exp^(z) / (1 + exp^z)

  1. Use scipy minimize to minimize the sum of -1 * [y_t*log(y_o) + (1-y_t)*log(1 - y_o)], summed over all datapoints. To do this you have to set up a function that takes (a_0, a_1, ...) as parameters and creates the function and then calculates the loss.

Stochastic Gradient Descent with Custom Loss

If you happen to be using a platform that has SGD regression with a custom loss then you can just use that, specifying a loss of y_t*log(y_o) + (1-y_t)*log(1 - y_o)

One way to do this is just to fork sci-kit learn and add log loss to the regression SGD solver.

Convert to Classification Problem

You can convert your problem to a classification problem by oversampling, as described by @jo9k. But note that even in this case you should not use standard X-validation because the data are not independent anymore. You will need to break up your data manually into train/test sets and oversample only after you have broken them apart.

Convert to SVM

(Edit: I did some testing and found that on my test sets sigmoid kernels were not behaving well. I think they require some special pre-processing to work as expected. An SVM with a sigmoid kernel is equivalent to a 2-layer tanh Neural Network, which should be amenable to a regression task structured where training data outputs are probabilities. I might come back to this after further review.)

You should get similar results to logistic regression using an SVM with sigmoid kernel. You can use sci-kit learn's SVR function and specify the kernel as sigmoid. You may run into performance difficulties with 100,000s of data points across 1000 features.... which leads me to my final suggestion:

Convert to SVM using Approximated Kernels

This method will give results a bit further away from true logistic regression, but it is extremely performant. The process is the following:

  1. Use a sci-kit-learn's RBFsampler to explicitly construct an approximate rbf-kernel for your dataset.

  2. Process your data through that kernel and then use sci-kit-learn's SGDRegressor with a hinge loss to realize a super-performant SVM on the transformed data.

The above is laid out with code here

David R
  • 994
  • 1
  • 11
  • 27
  • 1
    I realized after writing this up that there is a far simpler solution. If you want to do logistic regression using a standard learning platform, just create a neural network that has no hidden layers and specify a logit/sigmoid activation function on the output layer. – David R Feb 11 '18 at 18:33
-1

Instead of using predict in the scikit learn library use predict_proba function

refer here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba

usernamenotfound
  • 1,540
  • 2
  • 11
  • 18
  • That is for prediction. My question is regarding training. I want to fit / train with probabilities. BTW I didn't downvote. – san Dec 05 '17 at 22:23
  • @san my bad, jumped the gun with my reply. Why don't you try and use beta regression to solve this problem? – usernamenotfound Dec 05 '17 at 22:26
  • Could you point me to a scalable implementation that can be called from Python – san Dec 05 '17 at 22:34