0

I work with the knn algorithm in R.

The alogrithm selects the k "closest" points in feature space and calculates predictions/probabilities based on these k closest points.

My problem/question is: Can I specify a maximum distance? For some points the "k nearest neighbors" may be so far away that it would not make sense to use them. So I need an extended version of the algorithm that gives me an "NA", if all the closest points are "too far". I also like to be able to specify this threshold as a hyperparameter and to tune this later on.

Does such a variant exist? And is implemented in R already?

ds_col
  • 129
  • 10
  • 1
    you can check out [kknn](https://www.rdocumentation.org/packages/kknn/versions/1.3.1/topics/kknn), it uses kernel functions to weight the neighbors according to their distances. I trust, you can define your own function, which can return 0 weight if the neighbors are more distant then you desire but I don't think you will improve accuracy over the optimal kernel implemented in the package. Interestingly this package allows for setting the parameter of the Minkowski distance. From my modest experience with the package the optimal Minkowski parameter is rarely close to 2 (Euclidean distance) – missuse Mar 23 '21 at 09:03
  • Thank you, @missuse, my idea is that the algorithm should not give any vote if neighbors are too far away, so I would have a prediction that applies only for a subset of the test set (and accuracy would then only be calculated on this subset.) – ds_col Mar 23 '21 at 09:28
  • 1
    Here is a blog post about [KernelKnn](https://www.r-bloggers.com/2016/07/kernel-k-nearest-neighbors/) in which you can define a weights function, the returned weights can be 0 for neighboring points that do not satisfy a criteria. Btw why did you tag this with mlr3, it has nothing to do with mlr3? – missuse Mar 23 '21 at 09:43
  • I tagged mlr3 because I need to include the (adapted) algorithm in a larger automated framework where I could let a hyperparameter tuning algorithm determine the distance level (given that I could adapt the performance measure such that only predictions without NA are used). I think is then more a question for the framework, not for the algorithm per se. – ds_col Mar 23 '21 at 10:35
  • 1
    Something like DBscan might be more suitable here. Points that are too far from anything else can be assigned to a "noise" cluster. – Lars Kotthoff Mar 23 '21 at 15:47
  • 1
    If it doesn't need to be incredibly fast and memory efficient then you can just write it in R yourself as knn is pretty simple. See the [chapter on how to implement a Learner](https://mlr3book.mlr-org.com/extending-learners.html). – mb706 Mar 23 '21 at 17:23

0 Answers0