How a CNN based network developed for 2D RGB data, be trained on 3D data like LINEMOD dataset for object pose estimation?

Question

PoseCNN (PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes) is using a CNN as backbone network. CNNs can be trained on 2D RGB data. How can we train the PoseCNN using 3D data (RGB-D like) like LINEMOD, Occlusion LINEMOD, YCB-Video datasets (as these datasets contains 3D models) for 3D object detection. I don't really understand it.

https://github.com/yuxng/PoseCNN

Similarly PVNet (PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation) use same datasets for training their network.

https://github.com/zju3dv/pvnet

They basically take a semantic segmentation model and further modify for object pose estimation. But we use RGB datasets for training classification and semantic segmentation models based on CNNs. For object pose estimation they use 3D datasets like LINEMOD, Occlusion LINEMOD, and YCB-Videos datasets. These datasets contains 3D models of objects.

Note: These papers use simple single RGB camera.

What specifically do we need to do to create such 3D datasets?

@nosbor, I put the code links there. Code is too long and complicated for me, I can't understand it. That's why I asked this question. If I get the idea then I may understand the code also. Thanks — ML Dev, Dec 17 '20 at 09:52
Your question is quite general. As I wrote, you can reshape your 3D data to 2D using `reshape` method and train the network on such reshaped data. For example you have following 3D data: `asd = np.random.rand( 10, 10, 10 )` then you may reshape it in following way: `asd2d = asd.reshape( 10, 100 )`. The reshape method is available in all DNN frameworks such as keras, tensorflow or pytorch. I can't tell you if the new shape makes sense in your use case as I don't know these packages you are referring to and I don't know what you want to acheive. — nosbor, Dec 17 '20 at 11:16
I find it hard to understand what the actual question is. Convolution is not limited to 2D inputs like images. There are also 1D or 3D CNNs, and in principle you can do convolution over any number of dimensions. — xdurch0, Dec 17 '20 at 12:49
@nosbor, If I need to convert 3D data to 2D data, then why do they use 3D models in their datasets? — ML Dev, Dec 18 '20 at 16:28
This is for robot manipulation using RGB camera (no depth/stereo cameras). The above papers use CNN trained on ImageNet like VGG16 or ResNet (transefer learning) & then they transform it to semantic segmentation network. Simply say they use semantic segmentation models like FCN trained on COCO, Pascal VOC etc. Then they add 3D rotation & translation or PnP/ RANSAC+PnP for 6D object pose estimation. Which is basically detection of objects in 3D or finding 3D object bounding boxes. — ML Dev, Dec 18 '20 at 17:08
Objects in real world are in 3D & camera frames are in 2D then finding 2D-3D correspondences to generate 3D or calculating depth or distance of objects from camera etc. That's why they train the network on 3D datasets like LINEMOD etc. How to train such model on 3D dataset? As we can see that they transform FCN like networks trained on RGB data to a network that needs to be trained on 3D data for finding 6D object poses in real world? I can’t explain better than this. Thanks — ML Dev, Dec 18 '20 at 17:12

How a CNN based network developed for 2D RGB data, be trained on 3D data like LINEMOD dataset for object pose estimation?

0 Answers0