2

I have implemented my own kNN-algorithm with the iris dataset in python. Now I would like to be able to report the training and test error for different kinds of k. I have calcultated the accuracy of my predictions, but don't really know how to get the training and test error from this. Any ideas?

Thank you in advance

EDIT: Here's the code

import pandas as pd
import math
import operator
from sklearn.model_selection import train_test_split


def euclideanDistance(instance1, instance2, length):
    distance = 0
    for x in range(length):
        distance += pow((instance1[x] - instance2[x]), 2)
    return math.sqrt(distance)


def getNeighbors(trainingSet, testInstance, k):
    distances = []
    length = len(testInstance) - 1
    for x in range(len(trainingSet)):
        dist = euclideanDistance(testInstance, trainingSet.iloc[x], length)
        distances.append((trainingSet.iloc[x], dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(k):
        neighbors.append(distances[x][0])
    return neighbors


def getResponse(neighbors):
    classVotes = {}
    for x in range(len(neighbors)):
        response = neighbors[x][-1]
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1
        sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)

    return sortedVotes[0][0]


def getAccuracy(testSet, predictions):
    correct = 0
    for x in range(len(testSet)):
        if testSet.iloc[x][-1] == predictions[x]:
            correct += 1
    return (correct / float(len(testSet))) * 100.0


def main():
    dataset = pd.read_csv('DataScience/iris.data.txt',
                          names=["Atr1", "Atr2", "Atr3", "Atr4", "Class"])

    x = dataset.drop(['Class'], axis=1)
    y = dataset.drop(["Atr1", "Atr2", "Atr3", "Atr4"], axis=1)

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=65, stratify=y)

trainingSet = pd.concat([x_train, y_train], axis=1)
testSet = pd.concat([x_test, y_test], axis=1)
# prepare data

# generate predictions
predictions = []
k = 5
for x in range(len(testSet)):
    neighbors = getNeighbors(trainingSet, testSet.iloc[x], k)
    result = getResponse(neighbors)
    predictions.append(result)
    print('> predicted=' + repr(result) + ', actual=' + repr(testSet.iloc[x][-1]))
accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) + '%')

main()

user10411263
  • 49
  • 3
  • 10

2 Answers2

2

You can thinkg of training and testing errors as the flip side of your accuracy. For instance if you are 60% accurate in testing, you will have about 40% error in testing. Usually you can graph the accuracy vs. different k's to get a feel for how your algorithm performs with different k's.

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier 
import matplotlib.pyplot as plt

# create a training and testing set (use your X and y)    
X_train,X_test, y_train, y_test= train_test_split(X,y,random_state=42, test_size=.3)
# create a set of k values and an empty list for training and testing accuracy scores
k_values=[1,2,3,4,5,6,7,8,9,10]
train_scores=[]
test_scores=[]
# instantiate the model 
k_nn=KNeighborsClassifier()
# create a for loop of models with different k's 

for k in k_values: 
  k_nn.n_neighbors=k 
  k_nn.fit(X_train,y_train)
  train_score=k_nn.score(X_train,y_train)
  test_score=k_nn.score(X_test,y_test)
  train_scores.append(train_score)
  test_scores.append(test_score)

plt.plot(k_values,train_scores, color='red',label='Training Accuracy')
plt.plot(k_values,test_scores, color='blue',label='Testing Accuracy')
plt.xlabel('K values')
plt.ylabel('Accuracy Score')
plt.title('Performace Under Varying K Values')    
DZtron
  • 128
  • 8
  • I see, I think I get the concept of it. However, this is for a data science class, where we're not allowed to use he KNN libraries. What is it that the k_nn.score actually does? Right now I have a function that finds the k nearest neighbors and sorts them. Then I look how well my results was compared to the real ones, and that's how I get the accuracy. Can I use this in any way to get the test and training error? – user10411263 Oct 08 '18 at 06:20
  • Can you post your code? I am not really sure what you mean. k_nn.score computes the accuracy of the particular KNeighborsClassifier(). – DZtron Oct 08 '18 at 22:49
  • I posted the code in the question, can I use that one (without involving the kNN library) to get the training and test error? – user10411263 Oct 10 '18 at 08:05
  • In the line before the last pass in the 'trainingSet' to get the accuracy of the training (so the training error will be the flip side of that). Do the same for the testing error by passing 'testSet' (which you already did). – DZtron Oct 11 '18 at 02:42
  • Okay, so the test error 100-accuracy (as I called it in my code) and for the training error I take train_accuracy = getAccuracy(trainingSet, predictions) And get the error by taking 100-train_accuracy? And would it be weird if the train error is practically the same for k=1 allt the way to k=20 (araound 60%)? And the test error gets larger (but is still around 5-10% – user10411263 Oct 11 '18 at 06:23
  • Take a look at this graph on this link https://www.google.com/search?q=knn+test+error+graph&client=firefox-b-1-ab&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiIqaDh2_7dAhVECKwKHc_DDagQ_AUIDigB&biw=1680&bih=873#imgrc=E1iHL6P7VIliBM: The training error the can dip down and then come back up depending on what k you are looking for – DZtron Oct 11 '18 at 15:40
  • Okay, thank you so much for your help, you're the best! – user10411263 Oct 12 '18 at 01:47
0

The training error and test error is simply the error when making predictions on the training set and test set, respectively.

All you need to do is measure predictions on your training set and test set.

henrikstroem
  • 2,978
  • 4
  • 35
  • 51