6

I am using the scikit-learn KNeighborsClassifier for classification on a dataset with 4 output classes. The following is the code that I am using:

knn = neighbors.KNeighborsClassifier(n_neighbors=7, weights='distance', algorithm='auto', leaf_size=30, p=1, metric='minkowski')

The model works correctly. However, I would like to provide user-defined weights for each sample point. The code currently uses the inverse of the distance for scaling using the metric='distance' parameter.

I would like to continue to keep the inverse distance scaling but for each sample point, I have a probability weight as well. I would like to apply this as a weight in the distance calculation. For example, if x is the test point and y,z are the two nearest neighbors for which distance is being calculated, then I would like the distance to be calculated as (sum|x-y|)*wy and (sum|x-z|)*wz respectively.

I tried to define a function that was passed into the weights argument but then I also would like to keep the inverse distance scaling in addition to the user defined weight and I do not know the inverse distance scaling function. I could not find an answer from the documentation.

Any suggestions?

3 Answers3

3

KNN in sklearn doesn't have sample weight, unlike other estimators, e.g. DecisionTree. Personally speaking, I think it is a disappointment. It is not hard to make KNN support sample weight, since the predicted label is the majority voting of its neighbours. A stupid walk around, is to generate samples yourself based on the sample weight. E.g., if a sample has weight 2, then make it appear twice.

Kai Wang
  • 3,303
  • 1
  • 31
  • 27
0

You can use resampling to adapt your sample weights with K-neighbors since the sklearn implementation does not include sample weights. Here is how you could do this:

import numpy as np
from sklearn.neighbors import KNeighborsClassifier

# Get training and testing data
Xtrain, ytrain, sample_weight_train = get_train_data() 
Xtest, ytest, sample_weight_test = get_test_data()

# Derive probability values from your sample weights
prob_train = np.asarray(sample_weight_train) / np.sum(sample_weight_train)
upsample_size = int(np.max(prob_train) / np.min(prob_train) * len(ytrain))
newids = np.random.choice(range(len(ytrain)), size=upsample_size, p=prob_train, replace=True)

# Upsample training data using sample weights as probabilities
# so that the data distribution is upsampled to fit the corresponding sample weights
Xtrain, ytrain = Xtrain[newids,:], ytrain[newids]

# Fit your model
model = KNeighborsClassifier()
model = model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)

ProteinGuy
  • 1,754
  • 2
  • 17
  • 33
-1

sklearn.neighbors.KNeighborsClassifier.score() has a sample_weight parameter. Is that what you're looking for?

ItM
  • 301
  • 1
  • 5
  • 16