-2

I have more than 1M data points and 32 of them (Orange in the pic) are my true class.
I would like to find similar blue points to the orange ones.
Feature vectors are just embeddings.
enter image description here
The approach that I took is to build a pseudo 95 confidence region and then flag the points within that area as my true label. I think I cannot use a KNN algorithm for the following reasons:

  • I only know beforehand what points belong to the positive class.
  • KNN would be highly overfitted as I only have 32 positive data points over more than 1M dat points.
    Is there any other algorithm or approach that suits better this problem?
3nomis
  • 1,175
  • 1
  • 9
  • 30

1 Answers1

-1

Clustering very large data sets tend to grind to a halt. Here's a crazy idea. Can you take a random sample of the data set and work with that? If the selection process is totally random, it's just a subset of your full data set, and the smaller piece should be very representative of the full thing. It should be as simple as this.

subset = df.sample(frac=0.5)

See this link for more info.

https://towardsdatascience.com/how-to-sample-a-dataframe-in-python-pandas-d18a3187139b

ASH
  • 20,759
  • 19
  • 87
  • 200
  • Thanks for your answer, but I am not sure if I well understood well. I already sampled the dataset from a 50M points dataset. Any suggestion about the approach? Not sure if it is indeed a real clustering because I already know beforehand I have 32 labelled points. – 3nomis Dec 06 '21 at 08:59
  • That doesn't sound like a clustering experiment. Clustering is unsupervised because you don't know what the outcome will be. – ASH Dec 08 '21 at 19:08