1

I'm using SCIkit KNN and levenstein distance to some work on strings, much like this example at the bottom of this page: http://scikit-learn.org/stable/faq.html . The difference being my data is split into training sets and is in a dataframe.

The split is listed here:

train_feature, test_feature, train_class, test_class = train_test_split(features, classes,
                                                    test_size=TEST_SET_SIZE, train_size=TRAINING_SET_SIZE,
                                                    random_state=42)

I have the following:

>>> model = KNeighborsClassifier(metric='pyfunc',func=machine_learning.custom_distance)
>>> model.fit(train_feature['id'], train_class.as_matrix(['gender']))
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='pyfunc',
       metric_params={'func': <function custom_distance at 0x7fd0236267b8>},
       n_neighbors=5, p=2, weights='uniform')

Where train_features has one column ([24000 rows x 1 columns]), id and train_class (Name: gender, dtype: object) is a series with "gender" which is 'M' or 'F'. The id corresponds to a key in a dict elsewhere.

The custom distance function is:

def custom_distance(x,y):
i, j = int(x[0]), int(y[0])
return damerau_levenshtein_distance(lookup_dict[i],lookup_dict[j])

When I try to get the accuracy of the model:

 accuracy = model.score(test_feature, test_class)

I receive this error:

 ValueError: Expected n_neighbors <= 1. Got 5

I'm honestly really confused. I've checked the length of each of my datasets and they are fine. Why would it be telling me I only have one data point to plot from? Any help would be greatly appreciated.

user2757902
  • 493
  • 7
  • 18
  • As a slight reframing of your last point: The error is telling you you have 5 neighbors, but the problem is that it is expecting one – Ryan May 02 '15 at 07:00
  • Maybe try working up from a simple example which mimics your current set up to replicate the problem and find out where the issue is. Maybe also try other ways of generating training/test sets and scoring the accuracy of the model besides scikit's built-in functions – Ryan May 02 '15 at 07:07
  • I received the same error when using the example and NearestNeighbor. – user2757902 May 02 '15 at 07:51

4 Answers4

4

The classifier thinks that your dataset has only a single entry. Probably it interprets the vector of id's as a row vector instead of a column vector.

Try

model.fit(train_feature.as_matrix(['id']), train_class.as_matrix(['gender']))

and see if it helps.

cfh
  • 4,576
  • 1
  • 24
  • 34
  • I got this: ValueError: Found arrays with inconsistent numbers of samples: [ 1 245386] But I checked the length of each training set and they both are the same as their corresponding class set – user2757902 May 03 '15 at 04:09
  • Also when I try to put in hte datasets directly and see what is going on inside the function I have the same issue as this guy http://stackoverflow.com/questions/23420605/clustering-using-a-custom-distance-metric-for-lat-long-pairs – user2757902 May 03 '15 at 05:36
  • @user2757902: Check that not only the lengths, but also the shapes of the arrays match. I suspect that one is a row vector and the other a column vector. Both should be column vectors (shape (N,1)) when you use them as input to `fit`. – cfh May 03 '15 at 09:47
  • I tried this with some simple arrays and ran into the same issue. Here is the code: http://pastie.org/private/9afcezwmgkla1xkywh9qzq and a screenshot of the paremeters of the distance function compared to the input array: http://i.imgur.com/eXy4vaA.png – user2757902 May 03 '15 at 21:35
  • @user2757902: The code you posted on pastie works for me when I replace the `custom_distance` function by something which doesn't have extra dependencies. (I claimed something wrong before, the Y array actually has to be an 1D array, but you already fixed that with the `ravel()`) – cfh May 03 '15 at 21:47
1

I faced the same error. I have a huge db where I get the train and test data, but for code testing purposes I use a quite smaller one (~0.5% of the original). In the training procedure, I test a number of different neighbors, f.e

for neighbor in range(5,19): ...

The ValueError exception was raised for n_neigbors=19. This error was thrown only when I used the small db. The reason is that it didn't have the actual data input to create 19 different measurements. When I tested with the full db, no such exception was raised.

Setting algorithm='brute' will not solve the problem although it might work. The thing you should do is check the length of your observations , both training and testing, and put an upper limit to the value of n_neighbors accordingly.

viajero cósmico
  • 75
  • 1
  • 1
  • 7
1

Just set the n_neighbors values

knn = KNeighborsClassifier(n_neighbors=1)
abhimanyu
  • 730
  • 1
  • 10
  • 23
0

I figured it out. I needed to set the model to brute force and metric to the distance:

model = KNeighborsClassifier(metric=machine_learning.custom_distance,algorithm='brute',n_neighbors=50)
user2757902
  • 493
  • 7
  • 18