Nearest Neigborood using a confidence region

Question

I have more than 1M data points and 32 of them (Orange in the pic) are my true class.
I would like to find similar blue points to the orange ones.
Feature vectors are just embeddings.

The approach that I took is to build a pseudo 95 confidence region and then flag the points within that area as my true label. I think I cannot use a KNN algorithm for the following reasons:

I only know beforehand what points belong to the positive class.
KNN would be highly overfitted as I only have 32 positive data points over more than 1M dat points.
Is there any other algorithm or approach that suits better this problem?

score -1 · Answer 1 · answered Dec 06 '21 at 04:59

-1

Clustering very large data sets tend to grind to a halt. Here's a crazy idea. Can you take a random sample of the data set and work with that? If the selection process is totally random, it's just a subset of your full data set, and the smaller piece should be very representative of the full thing. It should be as simple as this.

subset = df.sample(frac=0.5)

See this link for more info.

https://towardsdatascience.com/how-to-sample-a-dataframe-in-python-pandas-d18a3187139b

answered Dec 06 '21 at 04:59

ASH

20,759
19
87
200

Thanks for your answer, but I am not sure if I well understood well. I already sampled the dataset from a 50M points dataset. Any suggestion about the approach? Not sure if it is indeed a real clustering because I already know beforehand I have 32 labelled points. – 3nomis Dec 06 '21 at 08:59
That doesn't sound like a clustering experiment. Clustering is unsupervised because you don't know what the outcome will be. – ASH Dec 08 '21 at 19:08

Nearest Neigborood using a confidence region

1 Answers1