0

I am trying to learn KNN by working on Breast cancer dataset provided by UCI repository. The Total size of dataset is 699 with 9 continuous variables and 1 class variable.

I tested my accuracy on cross-validation set. For K =21 & K =19. Accuracy is 95.7%.

from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=21)
neigh.fit(X_train, y_train) 
y_pred_val = neigh.predict(X_val)
print accuracy_score(y_val, y_pred_val)

But for K= 1, I am getting Accuracy = 97.85% K = 3, Accuracy = 97.14

I read

Choice of k is very critical – A small value of k means that noise will have a higher influence on the result. A large value make it computationally expensive and kinda defeats the basic philosophy behind KNN (that points that are near might have similar densities or classes ) .A simple approach to select k is set k = n^(1/2). here

Which value of K should I consider for my model. Can you guys elaborate the logic behind it?

Thanks in advance!

Community
  • 1
  • 1
Rahul Saxena
  • 422
  • 1
  • 9
  • 22
  • 1
    Accuracy alone is nt a sufficient criterion. You also have to consider Recall. –  Dec 22 '16 at 08:49
  • Hi @YvesDaoust, Thanks for the suggestion. Will calculate Precision- recall and will update the post. – Rahul Saxena Dec 22 '16 at 08:51
  • Voting to close. This question is off-topic for Stack Overflow (it is not about progamming), and should be moved to [Cross Validated](http://stats.stackexchange.com/help/on-topic). Or rather more likely, it has probably already been asked on CV and you should do a search before posting a new question. – juanpa.arrivillaga Dec 22 '16 at 08:58
  • But to be brief: there is no "correct" answer, that is, in all but the most simple cases, you will not know ahead of time which values for K will give you better performance (of course, higher K will always degrade computational performance). – juanpa.arrivillaga Dec 22 '16 at 09:02
  • Hi @juanpa.arrivillaga, I studied about the topic but still its not clear. I think stackoverflow is about help programmers. – Rahul Saxena Dec 22 '16 at 09:03
  • 1
    @RahulSaxena Yes, well, it is not a simple topic with neat answers. And yes, Stack Overflow is for programming questions. You question has nothing to do with programming, and is soley related to statistics/machine-learning. It is better suited for the Stack Exchange cite specific for those topics. – juanpa.arrivillaga Dec 22 '16 at 09:07

0 Answers0