I'm using a detection-based CNN for hand pose estimation(finding hand joints in a depth image of a single hand). My plan was to first use an FCN to find the 2D coordinates of all the 16 key points. The backbone is ResNet-50-FPN and the computation graph can be seen here. The structure of res2a~res5c is shown here.
As I train this model with ICVL hand posture dataset, the output feature maps converged as total black images where all pixel values are nearly zero. Ground truths are depth maps and heatmaps like this. If I add a sigmoid activation function after the last convolutional layer(as the image shows), the output heatmap would resemble white noise. Anyway, the detection FCN was totally useless, while the loss won't drop at all.
My CNN model can be briefly demonstrated with the code below:
heat_chain = TensorChain(image_tensor) \
.convolution_layer_2d(3, 16, 1, 'conv1') \
.batch_normalization() \
.relu('relu1') \
.max_pooling_layer_2d(2, 'pool1') \
.bottleneck_2d(64, 256, 'res2a') \
.bottleneck_2d(64, 256, 'res2b') \
.bottleneck_2d(64, 256, 'res2c') \
.branch_identity_mapping() \
.bottleneck_2d(128, 512, 'res3a', stride=2) \
.bottleneck_2d(128, 512, 'res3b') \
.bottleneck_2d(128, 512, 'res3c') \
.bottleneck_2d(128, 512, 'res3d') \
.branch_identity_mapping() \
.bottleneck_2d(256, 1024, 'res4a', stride=2) \
.bottleneck_2d(256, 1024, 'res4b') \
.bottleneck_2d(256, 1024, 'res4c') \
.bottleneck_2d(256, 1024, 'res4d') \
.bottleneck_2d(256, 1024, 'res4e') \
.bottleneck_2d(256, 1024, 'res4f') \
.branch_identity_mapping() \
.bottleneck_2d(512, 2048, 'res5a', stride=2) \
.bottleneck_2d(512, 2048, 'res5b') \
.bottleneck_2d(512, 2048, 'res5c') \
.upsampling_block_2d(2, [-1, 30, 40, 512], 'upsample1') \
.merge_identity_mapping_2d('merge1') \
.upsampling_block_2d(2, [-1, 60, 80, 256], 'upsample2') \
.merge_identity_mapping_2d('merge2') \
.upsampling_block_2d(2, [-1, 120, 160, 64], 'upsample3') \
.merge_identity_mapping_2d('merge3') \
.upsampling_block_2d(2, [-1, 240, 320, 16], 'upsample4') \
.convolution_layer_2d(3, 16, 1, 'conv2') \
.convolution_layer_2d(3, 16, 1, 'conv3')
heatmaps = tf.identity(heat_chain.output_tensor, name='heatmaps')
heat_loss = tf.reduce_mean(
tf.reduce_sum(tf.pow(heatmaps - heat_ground_truth, 2), axis=[1, 2, 3]), name='heat_loss')
In which branch_identity_mapping() pushes the last tensor into a stack and merge_identity_mapping_2d() pops a stored tensor and adds it to the current tensor(may also match dimensions with a 1x1 convolutional layer).
I'm totally confused about what is wrong. Could my implementation of ResNet-50-FPN be incorrect, or is something important missing?