Keypoint recognition as classification?

Question

At the end of the introduction to this instructive kaggle competition, they state that the methods used in "Viola and Jones' seminal paper works quite well". However, that paper describes a system for binary facial recognition, and the problem being addressed is the classification of keypoints, not entire images. I am having a hard time figuring out how, exactly, I would go about adjusting the Viola/Jones system for keypoint recognition.

I assume I should train a separate classifier for each keypoint, and some ideas I have are:

iterate over sub-images of a fixed size and classify each one, where an image with a keypoint as center pixel is a positive example. In this case I'm not sure what I would do with pixels close to the edge of the image.
instead of training binary classifiers, train classifiers with l*w possible classes (one for each pixel). The big problem with this is that I suspect it will be prohibitively slow, as every weak classifier suddenly has to do l*w*original operations
the third idea I have isn't totally hashed out in my mind, but since the keypoints are each parts of a greater part of a face (left, right center of an eye, for example), maybe I could try to classify sub-images as just an eye, and then use the left, right, and center pixels (centered in the y coordinate) of the best-fit subimage for each face-part

Is there any merit to these ideas, and are there methods I haven't thought of?

score 3 · Answer 1 · answered Oct 23 '13 at 04:43

3

however, that paper describes a system for binary facial recognition

No, read the paper carefully. What they describe is not face specific, face detection was the motivating problem. The Viola Jones paper introduced a new strategy for binary object recognition.

You could train a Viola Jones style Cascade for eyes, another for a nose, and one for each keypoint you are interested in.

Then, when you run the code - you should (hopefully) get 2 eyes, 1 nose, etc, for each face.

Provided you get the number of items you expected, you can then say "here are the key points!" What takes more work is getting enough data to build a good detector for each thing you want to detect, and gracefully handling false positives / negatives.

answered Oct 23 '13 at 04:43

Raff.Edward

6,404
24
34

I'm not confused about the fact that the system they describe is for general object recognition. I am confused about how to recognize an object (ie, a group of pixels) vs an individual pixel location. – mavix Oct 23 '13 at 06:23
1

How could you ever identify any object from a single pixel? You can't. You can identify a nose, which takes up multiple pixels. If you needed a center most position, you could take the center pixel of what was determined to be the nose. – Raff.Edward Oct 23 '13 at 22:26
Again, the problem is identification of a pixel location, and NOT object recognition. Which is precisely the issue. – mavix Oct 24 '13 at 18:55
1

No, pixel location is not the issue. It is not possible to say "this is THE nose pixel". You are asking for something that is not possible. The pixel shown in your link is meant to be the CENTER of a noes / eye. The problem is object recognition. The representation you are asking about jut happens to mark only one pixel. – Raff.Edward Oct 24 '13 at 21:55
...so you're saying I should do the first thing I listed? – mavix Oct 25 '13 at 19:31
There are more steps than what you listed for performing accurate objet classification. You should read up on that material. It is not a simple task that you can learn over night. What you described, on its own, will probably miss most objects. – Raff.Edward Oct 25 '13 at 23:10

score 0 · Accepted Answer · answered Aug 27 '15 at 16:37

I ended up working on this problem extensively. I used "deep learning," aka several layers of neural networks. I used convolutional networks. You can learn more about them by checking out these demos:

http://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html

http://deeplearning.net/tutorial/lenet.html#lenet

I made the following changes to a typical convolutional network:

I did not do any down-sampling, as any loss of precision directly translates to a decrease in the model's score
I did n-way binary classification, with each pixel being classified as a keypoint or non-keypoint (#2 in the things I listed in my original post). As I suspected, computational complexity was the primary barrier here. I tried to use my GPU to overcome these issues, but the number of parameters in the neural network were too large to fit in GPU memory, so I ended up using an xl amazon instance for training.

Here's a github repo with some of the work I did: https://github.com/cowpig/deep_keypoints

Anyway, given that deep learning has blown up in popularity, there are surely people who have done this much better than I did, and published papers about it. Here's a write-up that looks pretty good:

http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/

Keypoint recognition as classification?

2 Answers2