I am working on an image classification problem where I should be able to classify an image as say a watch with a rectangular dial/ a watch with a circular dial/ a shoe etc..
I have looked into Content Based Image Retrieval (using Dense SIFT for feature detection and Bag of Words + SVM for classification) and am currently exploring Convolutional Neural Networks (Unsupervised Feature Learning).
My problem is that the image is a photo taken from a camera and hence contains other elements (not there in training data). For example, my training data for watches with rectangular dials contains only the watch whereas my test image has the watch and a portion of the hand as well or my test image of a shoe has the shoe oriented in a different direction (when compared with the training data for shoes).
How do I address this issue? Is CNN (Unsupervised Feature Learning) the correct approach or should I stick to D-SIFT + BOW + SVM? How do I collect appropriate training data?
Thank You