Value of k in k nearest neighbor algorithm

Question

I have 7 classes that needs to be classified and I have 10 features. Is there a optimal value for k that I need to use in this case or do I have to run the KNN for values of k between 1 and 10 (around 10) and determine the best value with the help of the algorithm itself?

Might want to look at [this article](http://www.kyb.tuebingen.mpg.de/fileadmin/user_upload/files/publications/Final_version_maier_5681[0].pdf) — NominSim, Jul 19 '12 at 20:44
oh no, unfortunately iam not that knowledgeable enough to read and understand that paper. could someone please help me out :( ? — user574183, Jul 19 '12 at 20:48

score 17 · Answer 1 · answered Jul 19 '12 at 21:03

17

In addition to the article I posted in the comments there is this one as well that suggests:

Choice of k is very critical – A small value of k means that noise will have a higher influence on the result. A large value make it computationally expensive and kinda defeats the basic philosophy behind KNN (that points that are near might have similar densities or classes ) .A simple approach to select k is set k = n^(1/2).

It's going to depend a lot on your individual cases, sometimes it is best to run through each possible value for k and decide for yourself.

answered Jul 19 '12 at 21:03

NominSim

8,447
3
28
38

could you please tell me whether n stands for number of classes ? – user574183 Jul 20 '12 at 06:00
1

You're classifying based on the features, so n stands for number of features. – NominSim Jul 20 '12 at 13:42
If I have 93 features than will 97 be an apt choice or i should choose 93^1/2? – M.Zaman May 05 '15 at 12:58
3

As stated, you should have take n^0.5 (where n=no of data instances, not features) as a starting value for k and change it accordingly. – goelakash Jul 23 '15 at 08:18
What does it mean if the "optimal k" did not give us the best results? – Wolfy Apr 01 '18 at 02:42

score 12 · Answer 2 · answered Aug 16 '16 at 02:51

Important thing to note in k-NN algorithm is the that the number of features and the number of classes both don't play a part in determining the value of k in k-NN algorithm. k-NN algorithm is an ad-hoc classifier used to classify test data based on distance metric, i.e a test sample is classified as Class-1 if there are more number of Class-1 training samples closer to the test sample compared to other Classes training samples. For Eg: If value of k = 5 samples, then the 5 closest training samples are selected based on a distance metric and then a voting for most number of samples per class is done. So if 3 samples belong to Class-1 and 2 belong to Class-5, then that test sample is classified as Class-1. So the value of k indicates the number of training samples that are needed to classify the test sample.

Coming to your question, the value of k is non-parametric and a general rule of thumb in choosing the value of k is k = sqrt(N)/2, where N stands for the number of samples in your training dataset. Another tip that I suggest is to try and keep the value of k odd, so that there is no tie between choosing a class but that points to the fact that training data is highly correlated between classes and using a simple classification algorithm such as k-NN would result in poor classification performance.

score 5 · Answer 3 · edited Nov 03 '22 at 12:58

5

In KNN, finding the value of k is not easy. A small value of k means that noise will have a higher influence on the result and a large value make it computationally expensive.

Data scientists usually choose :

An odd number if the number of classes is 2
Another simple approach to select k is set k = sqrt(n). where n = number of data points in training data.

edited Nov 03 '22 at 12:58

desertnaut

57,590
26
140
166

answered Mar 30 '19 at 03:26

Ashok Lathwal

359
1
4
12

1

The computational expense of a large `k` is not normally the most important issue. Large `k` will over-smooth ignoring local structure. – Epimetheus May 24 '21 at 12:24

score 3 · Answer 4 · answered Jun 27 '20 at 23:21

You may want to try this out as an approach to running through different k values and visualizing it to help your decision making. I have used this quite a number of times and it gave me the result I wanted:

error_rate = []

for i in range(1,50):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred = knn.predict(X_test)
    error_rate.append(np.mean(pred != y_test))

plt.figure(figsize=(15,10))
plt.plot(range(1,50),error_rate, marker='o', markersize=9)

score 1 · Answer 5 · answered Jul 01 '21 at 09:58

There are no pre-defined statistical methods to find the most favourable value of K. Choosing a very small value of K leads to unstable decision boundaries. Value of K can be selected as k = sqrt(n). where n = number of data points in training data Odd number is preferred as K value.

Most of the time below approach is followed in industry. Initialize a random K value and start computing. Derive a plot between error rate and K denoting values in a defined range. Then choose the K value as having a minimum error rate. Derive a plot between accuracy and K denoting values in a defined range. Then choose the K value as having a maximum accuracy. Try to find a trade off value of K between error curve and accuracy curve.

Value of k in k nearest neighbor algorithm

5 Answers5

Linked