122

I have the following code to test some of most popular ML algorithms of sklearn python library:

import numpy as np
from sklearn                        import metrics, svm
from sklearn.linear_model           import LinearRegression
from sklearn.linear_model           import LogisticRegression
from sklearn.tree                   import DecisionTreeClassifier
from sklearn.neighbors              import KNeighborsClassifier
from sklearn.discriminant_analysis  import LinearDiscriminantAnalysis
from sklearn.naive_bayes            import GaussianNB
from sklearn.svm                    import SVC

trainingData    = np.array([ [2.3, 4.3, 2.5],  [1.3, 5.2, 5.2],  [3.3, 2.9, 0.8],  [3.1, 4.3, 4.0]  ])
trainingScores  = np.array( [3.4, 7.5, 4.5, 1.6] )
predictionData  = np.array([ [2.5, 2.4, 2.7],  [2.7, 3.2, 1.2] ])

clf = LinearRegression()
clf.fit(trainingData, trainingScores)
print("LinearRegression")
print(clf.predict(predictionData))

clf = svm.SVR()
clf.fit(trainingData, trainingScores)
print("SVR")
print(clf.predict(predictionData))

clf = LogisticRegression()
clf.fit(trainingData, trainingScores)
print("LogisticRegression")
print(clf.predict(predictionData))

clf = DecisionTreeClassifier()
clf.fit(trainingData, trainingScores)
print("DecisionTreeClassifier")
print(clf.predict(predictionData))

clf = KNeighborsClassifier()
clf.fit(trainingData, trainingScores)
print("KNeighborsClassifier")
print(clf.predict(predictionData))

clf = LinearDiscriminantAnalysis()
clf.fit(trainingData, trainingScores)
print("LinearDiscriminantAnalysis")
print(clf.predict(predictionData))

clf = GaussianNB()
clf.fit(trainingData, trainingScores)
print("GaussianNB")
print(clf.predict(predictionData))

clf = SVC()
clf.fit(trainingData, trainingScores)
print("SVC")
print(clf.predict(predictionData))

The first two works ok, but I got the following error in LogisticRegression call:

root@ubupc1:/home/ouhma# python stack.py 
LinearRegression
[ 15.72023529   6.46666667]
SVR
[ 3.95570063  4.23426243]
Traceback (most recent call last):
  File "stack.py", line 28, in <module>
    clf.fit(trainingData, trainingScores)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/logistic.py", line 1174, in fit
    check_classification_targets(y)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/multiclass.py", line 172, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

The input data is the same as in the previous calls, so what is going on here?

And by the way, why there is a huge diference in the first prediction of LinearRegression() and SVR() algorithms (15.72 vs 3.95)?

mllamazares
  • 7,876
  • 17
  • 61
  • 89

3 Answers3

127

You are passing floats to a classifier which expects categorical values as the target vector. If you convert it to int it will be accepted as input (although it will be questionable if that's the right way to do it).

It would be better to convert your training scores by using scikit's labelEncoder function.

The same is true for your DecisionTree and KNeighbors qualifier.

from sklearn import preprocessing
from sklearn import utils

lab_enc = preprocessing.LabelEncoder()
encoded = lab_enc.fit_transform(trainingScores)
>>> array([1, 3, 2, 0], dtype=int64)

print(utils.multiclass.type_of_target(trainingScores))
>>> continuous

print(utils.multiclass.type_of_target(trainingScores.astype('int')))
>>> multiclass

print(utils.multiclass.type_of_target(encoded))
>>> multiclass
Maximilian Peters
  • 30,348
  • 12
  • 86
  • 99
  • 2
    Thank you! So I've got to convert `2.3` to `23` and so on, isn't it? There is a elegant way to make this conversion using numpy or pandas? – mllamazares Jan 29 '17 at 22:02
  • 3
    But, in this example the input data has float numbers using LogisticRegression function: http://machinelearningmastery.com/compare-machine-learning-algorithms-python-scikit-learn/ ... and it works ok. Why? – mllamazares Jan 29 '17 at 22:06
  • 3
    The input can be floats but the output need to be categorical, i.e. int. Column 8 is only 0 or 1 in this example. Usually it is the other way round that you have categorical labels, e.g. ['red', 'big', 'sick'] and you need to convert it numerical values. Try http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features or http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html – Maximilian Peters Jan 29 '17 at 22:12
  • Are `2.3` and `23` same? – Ajay Kulkarni Sep 25 '18 at 11:32
64

LogisticRegression is not for regression but classification !

The Y variable must be the classification class,

(for example 0 or 1)

And not a continuous variable,

that would be a regression problem.

Tomas G.
  • 3,784
  • 25
  • 28
  • 2
    I hope this is not spam but I ended up here many times and the error promt is not very intuitive. – Tomas G. Nov 25 '19 at 17:39
  • 3
    this should be the correct answer. Indeed LogisticRegression is a classifier. Hence the error. – navneeth Sep 09 '20 at 02:13
  • That's generally true, but sometimes you want to benefit from Sigmoid mapping the output to [0,1] during optimization. If you use least squares on a given output range, while training, your model will be penalized for extrapolating, e.g., if it predicts `1.2` for some sample, it would be penalized the same way as for predicting `0.8`. This constraint might distract the optimization from the objective. Sigmoid allows you to extrapolate with no penalty, so the model could say either `10` or `100` and Sigmoid will anyway turn it to almost 1. There is essence in continuous logistic regression. – SomethingSomething Jul 06 '22 at 12:45
  • plus it's a linear transformation (scale + bias) from any given range to [0,1] and vice versa, so you can always "normalize" your labels to [0,1] while training and remap them to the given range at inference. Note that if you use an iterative optimization of least-squares with your custom loss function (i.e., rather than using the pseudo-inverse algorithm), then you may be able to trim the model output prior to computing the cost and thus address the extrapolation penalization problem without logistic regression. – SomethingSomething Jul 06 '22 at 12:51
35

I struggled with the same issue when trying to feed floats to the classifiers. I wanted to keep floats and not integers for accuracy. Try using regressor algorithms. For example:

import numpy as np
from sklearn import linear_model
from sklearn import svm

classifiers = [
    svm.SVR(),
    linear_model.SGDRegressor(),
    linear_model.BayesianRidge(),
    linear_model.LassoLars(),
    linear_model.ARDRegression(),
    linear_model.PassiveAggressiveRegressor(),
    linear_model.TheilSenRegressor(),
    linear_model.LinearRegression()]

trainingData    = np.array([ [2.3, 4.3, 2.5],  [1.3, 5.2, 5.2],  [3.3, 2.9, 0.8],  [3.1, 4.3, 4.0]  ])
trainingScores  = np.array( [3.4, 7.5, 4.5, 1.6] )
predictionData  = np.array([ [2.5, 2.4, 2.7],  [2.7, 3.2, 1.2] ])

for item in classifiers:
    print(item)
    clf = item
    clf.fit(trainingData, trainingScores)
    print(clf.predict(predictionData),'\n')
Sam Perry
  • 2,554
  • 3
  • 28
  • 29