10

Can anyone tell me what's the problem with my code? Why I can predict probability of iris dataset by using LinearRegression but, KNeighborsClassifier gives me 0 or 1 while it should give me a result like the one LinearRegression yields?

from sklearn.datasets import load_iris
from sklearn import metrics

iris = load_iris()
X = iris.data
y = iris.target

for train_index, test_index in skf:
    X_train, X_test = X_total[train_index], X_total[test_index]
    y_train, y_test = y_total[train_index], y_total[test_index]

from sklearn.linear_model import LogisticRegression
ln = LogisticRegression()
ln.fit(X_train,y_train)

ln.predict_proba(X_test)[:,1]

array([ 0.18075722, 0.08906078, 0.14693156, 0.10467766, 0.14823032, 0.70361962, 0.65733216, 0.77864636, 0.67203114, 0.68655163, 0.25219798, 0.3863194 , 0.30735105, 0.13963637, 0.28017798])

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, algorithm='ball_tree', metric='euclidean')
knn.fit(X_train, y_train)

knn.predict_proba(X_test)[0:10,1]

array([ 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.])

Kasra Babaei
  • 300
  • 1
  • 3
  • 12

2 Answers2

13

Because KNN has very limited concept of probability. Its estimate is simply fraction of votes among nearest neighbours. Increase number of neighbours to 15 or 100 or query point near the decision boundary and you will see more diverse results. Currently your points are simply always having 5 neighbours of the same label (thus probability 0 or 1).

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • But then my accuracy decreases because I'll go far from the optimal K. How come in weka, with the same K, we can get a more curvy ROC while here (scikit) the ROC is very sharp? – Kasra Babaei May 07 '16 at 13:46
  • KNN is a heuristic and has a lot of parameters. It is very probably that your results will differ. You have too look up the default values of used metrics and algorithms. And maybe even the ROC-curve evaluation is done differently! There is also randomness involved (in KNN)! – sascha May 10 '16 at 17:03
  • 1
    Probabilities output would be more precise if use of the option "weighted = distances" – agenis Apr 09 '19 at 16:21
-1

here, I have a knn model - model_knn

using sklearn

result = {}    
model_classes = model_knn.classes_
predicted = model_knn.predict(word_average)
score = model_knn.predict_proba(word_average)
index = np.where(model_classes == predicted[0])[0][0]
result["predicted"] = predicted[0]
result["score"] = score[0][index]
Abhijith M
  • 743
  • 5
  • 5