I am doing a project in computer vision and I need some help. The objective of my project is to extract the attributes of any object - for example if I have a Nike running shoe, I should be able to figure out that it is a shoe in the first place, then figure out that it is a Nike shoe and not an Adidas shoe (possibly because of the Nike tick) and then figure out that it is a running shoe and not football studs.
I have started off by treating this as an image classification problem and I am using the following steps:
- I have taken training samples (around 60 each) of say shoes, heels, watches and extracted their features using Dense SIFT.
- Creating a vocabulary using k-means clustering (arbitrarily chosen the vocabulary size to be 600).
- Creating a Bag-Of-Words representation for the images.
- Training an SVM classifier to obtain a bag-of-words (feature vector) for every class (shoe,heel,watch).
- For testing, I extracted the feature vector for the test image and found its bag-of-words representation from the already created vocabulary.
- I compared the bag-of-words of the test image with that of each class and returned the class which matched closest.
I would like to know how I should proceed from here? Will feature extraction using D-SIFT help me identify the attributes as it only represents the gradient around certain points?
And sometimes, my classification goes wrong, for example if I have trained the classifier with the images of a left shoe, and a watch, a right shoe is classified as a watch. I understand that I have to include right shoes in my training set to solve this problem, but is there any other approach that I should follow?
Also is there any way to understand the shape? For example if I have trained the classifier for watches, and there are watches with both circular and rectangular dials in the training set, can I identify the shape of any new test image? Or do I simply have train it separately for watches with circular and rectangular dials? Thanks