Can object detection models adapts to different point of view channels

Question

I have depth and thermal images of the same scene but with a slighly different point of view.

I use to compute the rotation/translation matrix in order to stack the two images in a (300, 300, 2) array. But can object detection model like SSD or Faster-R-CNN can implicitly learn this matrix ?

My labels boxes are done on the thermal images.

Will the pixels corresponding to the same object in the depth image be used even if they are not at the same position ?

Here is an illustration with the SSD model:

I drown only the boxes coordinates predictions (deltas between the best prior and real object position) without the corresponding object class predictions (5 x 5 x 4xnb_classes)

My first thought is that if the object in the depth image is not inside the label box (which is done on the thermal image), the network will detect 2 different objects and be penalized for predicting the one on the depth image (because no label box here) so the network will learn to ignore the depth channel.

Am I right ? Or is there a way the network can handle this pb and learn how to use pixels in depth channel too ? (can another object detection model handle this pb ?)

I think the core problem, intuitively, is that convolutions keep the localisations of objects across the network so we cant link a pixel in channel 1 (x, y) to a pixel in channel 2 (x+delta, y+delta)

Thank you for your time.

score 0 · Answer 1 · answered Nov 15 '19 at 09:23

0

This may work when both inputs are considered in channels, but to get better results its good to correct these before you feed it to any model. The model will not do any correction but statistic methods are applied.

answered Nov 15 '19 at 09:23

Suman

354
3
10

what do you mean by "statistic methods are applied" ? – antoine Mathu Nov 15 '19 at 09:36
1

I was saying that the models will just use static at its core, we need to align the views from both cameras before we feed into the SSD. I remember this is used in Intel Realsense cameras where RGB and IR cameras are aligned, I have been using these cameras in my project. Sharing the reference code of alinement - http://docs.ros.org/kinetic/api/librealsense2/html/rs-align_8cpp_source.html Hope this helps :) – Suman Nov 15 '19 at 12:38

Can object detection models adapts to different point of view channels

1 Answers1