I'm using SCIkit KNN and levenstein distance to some work on strings, much like this example at the bottom of this page: http://scikit-learn.org/stable/faq.html . The difference being my data is split into training sets and is in a dataframe.
The split is listed here:
train_feature, test_feature, train_class, test_class = train_test_split(features, classes,
test_size=TEST_SET_SIZE, train_size=TRAINING_SET_SIZE,
random_state=42)
I have the following:
>>> model = KNeighborsClassifier(metric='pyfunc',func=machine_learning.custom_distance)
>>> model.fit(train_feature['id'], train_class.as_matrix(['gender']))
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='pyfunc',
metric_params={'func': <function custom_distance at 0x7fd0236267b8>},
n_neighbors=5, p=2, weights='uniform')
Where train_features has one column ([24000 rows x 1 columns]), id and train_class (Name: gender, dtype: object) is a series with "gender" which is 'M' or 'F'. The id corresponds to a key in a dict elsewhere.
The custom distance function is:
def custom_distance(x,y):
i, j = int(x[0]), int(y[0])
return damerau_levenshtein_distance(lookup_dict[i],lookup_dict[j])
When I try to get the accuracy of the model:
accuracy = model.score(test_feature, test_class)
I receive this error:
ValueError: Expected n_neighbors <= 1. Got 5
I'm honestly really confused. I've checked the length of each of my datasets and they are fine. Why would it be telling me I only have one data point to plot from? Any help would be greatly appreciated.