1

I have a situation where I am trying to find out 3 nearest neighbours for a given ID in my dataframe. I am using NN alogrithm (not KNN) to achieve this. The below code is giving me the three nearest neighbours, for the top node the results are fine but for the middle ones and the bottom ones the accuracy is only 1/3 neighbours are correct whereas I am eyeing to have atleast 2/3 neighours correct at every ID. My dataset has 47 features and 5000 points.

from sklearn.neighbors import KDTree
def findsuccess(sso_id):
    neighbors_f_sso_id = np.where(nbrs.kneighbors_graph([X[i]]))[0]
    print('Neighbors of id', neighbors_f_sso_id)
  kdt = KDTree(X, leaf_size=40, metric='euclidean')
  kdt.query(X, k=4, return_distance=False)

The above code will return the ID itself and the 3 nearest neighbours ,hence k=4

I have read that due to curse of dimensionality, this NN algorithm might not work well as there are about 47 features in my dataset but this is the only option I think I have when it comes to a data frame without a target variable. There is one article available here that says the KD Tree is not best of the algorithms that can be used.

What would be the best way to achieve the maximum accuracy, meaning achieving minimum distance? Do I need to perform scaling before passing into KD Tree algorithm? Any other things that I need to take care off?

Django0602
  • 797
  • 7
  • 26
  • As you mention, you can try [standardizing](https://scikit-learn.org/stable/modules/preprocessing.html) the data before computing the nearest neighbors. – hilberts_drinking_problem Feb 27 '20 at 19:55
  • 1
    Unsupervised algos do not have such a measure as accuracy. There are about 30 different metrics of judging if clustering might be good or not (summarized in R's `NbClust` package). It's you, analyst, who with help of those metrics and business logic might decide if clustering helps in achieving business goals. Concerning standardization and distance measures, sometimes it helps sometimes it hurts. Depends on task at hand. – Sergey Bushmanov Feb 27 '20 at 19:59
  • @SergeyBushmanov: So in that case it's more trial and error? If I need to perform standardisation, for such a problem statement, which would be the best method, A minmax or a standard scalar? – Django0602 Feb 27 '20 at 20:22
  • 2
    Nobody knows in advance. Try and see which one is better. Beware, standardization removes some information from your analysis, making all features equal, which might be undesirable in some situations, e.g. with price or age. But in the end it's you again, to say that some method of clustering, and data preprocessing, is better than other. – Sergey Bushmanov Feb 27 '20 at 20:26

0 Answers0