I have a situation where I am trying to find out 3 nearest neighbours for a given ID in my dataframe. I am using NN alogrithm (not KNN) to achieve this. The below code is giving me the three nearest neighbours, for the top node the results are fine but for the middle ones and the bottom ones the accuracy is only 1/3 neighbours are correct whereas I am eyeing to have atleast 2/3 neighours correct at every ID. My dataset has 47 features and 5000 points.
from sklearn.neighbors import KDTree
def findsuccess(sso_id):
neighbors_f_sso_id = np.where(nbrs.kneighbors_graph([X[i]]))[0]
print('Neighbors of id', neighbors_f_sso_id)
kdt = KDTree(X, leaf_size=40, metric='euclidean')
kdt.query(X, k=4, return_distance=False)
The above code will return the ID itself and the 3 nearest neighbours ,hence k=4
I have read that due to curse of dimensionality, this NN algorithm might not work well as there are about 47 features in my dataset but this is the only option I think I have when it comes to a data frame without a target variable. There is one article available here that says the KD Tree is not best of the algorithms that can be used.
What would be the best way to achieve the maximum accuracy, meaning achieving minimum distance? Do I need to perform scaling before passing into KD Tree algorithm? Any other things that I need to take care off?