I have more than 1M data points and 32 of them (Orange in the pic) are my true class.
I would like to find similar blue points to the orange ones.
Feature vectors are just embeddings.
The approach that I took is to build a pseudo 95 confidence region and then flag the points within that area as my true label.
I think I cannot use a KNN algorithm for the following reasons:
- I only know beforehand what points belong to the positive class.
- KNN would be highly overfitted as I only have 32 positive data points over more than 1M dat points.
Is there any other algorithm or approach that suits better this problem?