I have recently implemented a recognition software following the technique described in this paper. However, my dataset also contains depth maps taken with OpenNI.
I'd like to increase robustness of the recognizer using depth information. I though about training 1-vs-all SVMs computing bow response histograms after extracting VFH descriptors (I adapted OpenCV DescriptorExtractor interface for this task). But the point is: how can I combine the two things to get more precise results? Can someone suggest me a strategy for this?
P.s. I would very much like to test the recognizer directly showing objects to a kinect (and not, like I'm doing right now, feeding cropped images to the recognizer).