Knn prediction going 100% on y_test

Question

I'm trying to implement K-nearest neighbors on Iris dataset but after doing the predictions, yhat goes 100% without errors, there must have something wrong and i have no idea what it is...

I created a column named class_id, where i changed:

setosa = 1.0
versicolor = 2.0
virginica = 3.0

that column is type float.

Getting X an Y


    x = df[['sepal length', 'sepal width', 'petal length', 'petal width']].values

type(x) shows nparray


    y = df['class_id'].values

type(y) shows nparray

Normalizing data


    x = preprocessing.StandardScaler().fit(x).transform(x.astype(float))

Creating train and test


    x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state = 42)

Checking best K value:


    Ks = 12
    for i in range(1,Ks):
       k = i
       neigh = KNeighborsClassifier(n_neighbors=k).fit(x_train,y_train)
       yhat = neigh.predict(x_test)
       score = metrics.accuracy_score(y_test,yhat)
       print('K: ', k, ' score: ', score, '\n')

Result:

K: 1 score: 0.9666666666666667

K: 2 score: 1.0

K: 3 score: 1.0

K: 4 score: 1.0

K: 5 score: 1.0

K: 6 score: 1.0

K: 7 score: 1.0

K: 8 score: 1.0

K: 9 score: 1.0

K: 10 score: 1.0

K: 11 score: 1.0

Printing y_test and yhat WITH K = 5


    print(yhat)
    print(y_test)

Result:

yhat: [2. 1. 3. 2. 2. 1. 2. 3. 2. 2. 3. 1. 1. 1. 1. 2. 3. 2. 2. 3. 1. 3. 1. 3. 3. 3. 3. 3. 1. 1.]

y_test: [2. 1. 3. 2. 2. 1. 2. 3. 2. 2. 3. 1. 1. 1. 1. 2. 3. 2. 2. 3. 1. 3. 1. 3. 3. 3. 3. 3. 1. 1.]

all of them shouldn't be 100% correct, there must be something wrong

You are making use of the `iris` dataset. It's a well cleaned and model dataset. The features have a strong correlation to the result which results in the `kNN` model fitting the data really well. To test this you can reduce the size of the training set and this will results in a drop in the accuracy. — skillsmuggler, May 31 '19 at 09:12
so, basically there's nothing wrong with the model going 100% accuracy? hmm i'll try that, you''probably right — Ruben Acevedo, May 31 '19 at 12:25

Freddy Daniel · Answer 1 · 2019-05-30T02:56:40.130

0

Try to make a confusion matrix. Test every example of your test data, and check metrics of the specificity, sensibility, accuracy and precision.

where:

TN = True Negative
FN = False Negative
FP = False Positive
TP = True Positive

Here you can check what is the difference between specificity and sensibility https://dzone.com/articles/ml-metrics-sensitivity-vs-specificity-difference

Here you have one example about how you can get one confusion matrix in python using sklearn.

Also try to make a ROC Curve (optional) https://en.wikipedia.org/wiki/Receiver_operating_characteristic

edited May 30 '19 at 02:56

answered May 30 '19 at 02:47

Freddy Daniel

369
2
16

The thing is that even with confusion matrix is goes 100% accuracy rate and I'm not sure how or why... – Ruben Acevedo May 30 '19 at 21:45
read it please: https://www.researchgate.net/post/Multiclass_Confusion_Matrix_Explanation – Freddy Daniel May 31 '19 at 00:01

score 0 · Accepted Answer · answered Jun 01 '19 at 00:16

I found the answer with the explanation of skillsmuggler(user):

You are making use of the iris dataset. It's a well cleaned and model dataset. The features have a strong correlation to the result which results in the kNN model fitting the data really well. To test this you can reduce the size of the training set and this will results in a drop in the accuracy.

Prediction model was correct.

Knn prediction going 100% on y_test

Getting X an Y

Normalizing data

Creating train and test

Checking best K value:

Result:

Printing y_test and yhat WITH K = 5

Result:

2 Answers2