2

I'm doing a recognition problem (faces) and trying to reduce the problem size. I originally began with training data in a feature-wise coordinate system in 120 dimensions, but through PCA I found a better PC-wise coordinate system needing only 20 dimensions while still conveying 95% of the data.

I began thinking that recognition by definition is a problem of classification. Points in n-space belonging to the same object/face/whatever would cluster. To take an example, if 5 instances of the same individual are in the training data, they would cluster and the mid-point of that cluster could be numerically defined using k-means.

I have 100,000 observations, each person is represented by 5-10 headshots, this means instead of comparing a novel input to 100,000 points in my 20-space, I could instead compare to 10,000-20,000 centroids. Can k-means be used like this or have I misinterpreted? k is obviously undefined but I've been reading up on ways to find optimal k.

My specific recognition problem doesn't use neural nets but rather simple arithmetic euclidean distances between points.

gator
  • 3,465
  • 8
  • 36
  • 76
  • There might be some merit to do that (in euclidean space), but there are two huge downsides: A: kmeans is a heuristic; it does not guarantee a global-optimum (problem is NP-hard) B: kmeans has some a-priori parameter k. It does not allow to chose k by itself. You did not specify much requirements for your task, but did you try some metric-tree, e.g. kd-tree / ball-tree? – sascha Dec 11 '18 at 19:56

0 Answers0