ValueError: Expected n_neighbors <= 1. Got 5 -Scikit K Nearest Classifier

Question

I'm using SCIkit KNN and levenstein distance to some work on strings, much like this example at the bottom of this page: http://scikit-learn.org/stable/faq.html . The difference being my data is split into training sets and is in a dataframe.

The split is listed here:

train_feature, test_feature, train_class, test_class = train_test_split(features, classes,
                                                    test_size=TEST_SET_SIZE, train_size=TRAINING_SET_SIZE,
                                                    random_state=42)

I have the following:

>>> model = KNeighborsClassifier(metric='pyfunc',func=machine_learning.custom_distance)
>>> model.fit(train_feature['id'], train_class.as_matrix(['gender']))
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='pyfunc',
       metric_params={'func': <function custom_distance at 0x7fd0236267b8>},
       n_neighbors=5, p=2, weights='uniform')

Where train_features has one column ([24000 rows x 1 columns]), id and train_class (Name: gender, dtype: object) is a series with "gender" which is 'M' or 'F'. The id corresponds to a key in a dict elsewhere.

The custom distance function is:

def custom_distance(x,y):
i, j = int(x[0]), int(y[0])
return damerau_levenshtein_distance(lookup_dict[i],lookup_dict[j])

When I try to get the accuracy of the model:

 accuracy = model.score(test_feature, test_class)

I receive this error:

 ValueError: Expected n_neighbors <= 1. Got 5

I'm honestly really confused. I've checked the length of each of my datasets and they are fine. Why would it be telling me I only have one data point to plot from? Any help would be greatly appreciated.

As a slight reframing of your last point: The error is telling you you have 5 neighbors, but the problem is that it is expecting one — Ryan, May 02 '15 at 07:00
Maybe try working up from a simple example which mimics your current set up to replicate the problem and find out where the issue is. Maybe also try other ways of generating training/test sets and scoring the accuracy of the model besides scikit's built-in functions — Ryan, May 02 '15 at 07:07
I received the same error when using the example and NearestNeighbor. — user2757902, May 02 '15 at 07:51

score 4 · Answer 1 · answered May 02 '15 at 08:21

4

The classifier thinks that your dataset has only a single entry. Probably it interprets the vector of id's as a row vector instead of a column vector.

Try

model.fit(train_feature.as_matrix(['id']), train_class.as_matrix(['gender']))

and see if it helps.

answered May 02 '15 at 08:21

cfh

4,576
1
24
34

I got this: ValueError: Found arrays with inconsistent numbers of samples: [ 1 245386] But I checked the length of each training set and they both are the same as their corresponding class set – user2757902 May 03 '15 at 04:09
Also when I try to put in hte datasets directly and see what is going on inside the function I have the same issue as this guy http://stackoverflow.com/questions/23420605/clustering-using-a-custom-distance-metric-for-lat-long-pairs – user2757902 May 03 '15 at 05:36
@user2757902: Check that not only the lengths, but also the shapes of the arrays match. I suspect that one is a row vector and the other a column vector. Both should be column vectors (shape (N,1)) when you use them as input to `fit`. – cfh May 03 '15 at 09:47
I tried this with some simple arrays and ran into the same issue. Here is the code: http://pastie.org/private/9afcezwmgkla1xkywh9qzq and a screenshot of the paremeters of the distance function compared to the input array: http://i.imgur.com/eXy4vaA.png – user2757902 May 03 '15 at 21:35
@user2757902: The code you posted on pastie works for me when I replace the `custom_distance` function by something which doesn't have extra dependencies. (I claimed something wrong before, the Y array actually has to be an 1D array, but you already fixed that with the `ravel()`) – cfh May 03 '15 at 21:47

score 1 · Answer 2 · answered May 06 '17 at 12:47

I faced the same error. I have a huge db where I get the train and test data, but for code testing purposes I use a quite smaller one (~0.5% of the original). In the training procedure, I test a number of different neighbors, f.e

for neighbor in range(5,19): ...

The ValueError exception was raised for n_neigbors=19. This error was thrown only when I used the small db. The reason is that it didn't have the actual data input to create 19 different measurements. When I tested with the full db, no such exception was raised.

Setting algorithm='brute' will not solve the problem although it might work. The thing you should do is check the length of your observations , both training and testing, and put an upper limit to the value of n_neighbors accordingly.

score 1 · Answer 3 · answered Dec 16 '19 at 11:21

1

Just set the n_neighbors values

knn = KNeighborsClassifier(n_neighbors=1)

answered Dec 16 '19 at 11:21

abhimanyu

730
1
10
23

score 0 · Answer 4 · answered May 03 '15 at 23:54

0

I figured it out. I needed to set the model to brute force and metric to the distance:

model = KNeighborsClassifier(metric=machine_learning.custom_distance,algorithm='brute',n_neighbors=50)

answered May 03 '15 at 23:54

user2757902

493
7
18

ValueError: Expected n_neighbors <= 1. Got 5 -Scikit K Nearest Classifier

4 Answers4

Linked