TensorFlow - keypoint detection yields a heatmap of zeros

Question

I'm using a detection-based CNN for hand pose estimation(finding hand joints in a depth image of a single hand). My plan was to first use an FCN to find the 2D coordinates of all the 16 key points. The backbone is ResNet-50-FPN and the computation graph can be seen here. The structure of res2a~res5c is shown here.

As I train this model with ICVL hand posture dataset, the output feature maps converged as total black images where all pixel values are nearly zero. Ground truths are depth maps and heatmaps like this. If I add a sigmoid activation function after the last convolutional layer(as the image shows), the output heatmap would resemble white noise. Anyway, the detection FCN was totally useless, while the loss won't drop at all.

My CNN model can be briefly demonstrated with the code below:

heat_chain = TensorChain(image_tensor) \
            .convolution_layer_2d(3, 16, 1, 'conv1') \
            .batch_normalization() \
            .relu('relu1') \
            .max_pooling_layer_2d(2, 'pool1') \
            .bottleneck_2d(64, 256, 'res2a') \
            .bottleneck_2d(64, 256, 'res2b') \
            .bottleneck_2d(64, 256, 'res2c') \
            .branch_identity_mapping() \
            .bottleneck_2d(128, 512, 'res3a', stride=2) \
            .bottleneck_2d(128, 512, 'res3b') \
            .bottleneck_2d(128, 512, 'res3c') \
            .bottleneck_2d(128, 512, 'res3d') \
            .branch_identity_mapping() \
            .bottleneck_2d(256, 1024, 'res4a', stride=2) \
            .bottleneck_2d(256, 1024, 'res4b') \
            .bottleneck_2d(256, 1024, 'res4c') \
            .bottleneck_2d(256, 1024, 'res4d') \
            .bottleneck_2d(256, 1024, 'res4e') \
            .bottleneck_2d(256, 1024, 'res4f') \
            .branch_identity_mapping() \
            .bottleneck_2d(512, 2048, 'res5a', stride=2) \
            .bottleneck_2d(512, 2048, 'res5b') \
            .bottleneck_2d(512, 2048, 'res5c') \
            .upsampling_block_2d(2, [-1, 30, 40, 512], 'upsample1') \
            .merge_identity_mapping_2d('merge1') \
            .upsampling_block_2d(2, [-1, 60, 80, 256], 'upsample2') \
            .merge_identity_mapping_2d('merge2') \
            .upsampling_block_2d(2, [-1, 120, 160, 64], 'upsample3') \
            .merge_identity_mapping_2d('merge3') \
            .upsampling_block_2d(2, [-1, 240, 320, 16], 'upsample4') \
            .convolution_layer_2d(3, 16, 1, 'conv2') \
            .convolution_layer_2d(3, 16, 1, 'conv3')
heatmaps = tf.identity(heat_chain.output_tensor, name='heatmaps')
heat_loss = tf.reduce_mean(
        tf.reduce_sum(tf.pow(heatmaps - heat_ground_truth, 2), axis=[1, 2, 3]), name='heat_loss')

In which branch_identity_mapping() pushes the last tensor into a stack and merge_identity_mapping_2d() pops a stored tensor and adds it to the current tensor(may also match dimensions with a 1x1 convolutional layer).

I'm totally confused about what is wrong. Could my implementation of ResNet-50-FPN be incorrect, or is something important missing?

score 1 · Answer 1 · answered Aug 28 '18 at 02:49

Can you also upload your training codes?

Also, does the ground truth heatmaps have shape of (batch, height, width, 16), and each channel is a gaussian peak around the coordinate of the keypoint? If so, then it is a pose estimation problem.

For now, try these 2 suggestions to start.

Try the model on 1 training image without any regularization and image augmentation. Then apply different scale of learning rate. See if the loss is decreasing and the prediction of training image is similiar to the ground truth heatmap.
Your loss function looks ok, although I would just use : tf.reduce_sum(tf.square(heatmaps - heat_ground_truth, 2) , name='heat_loss')

If these won't help, I would suggest try some pose estimation methods as the FPN is more a object dectection and semantic segmentaion method. Convolutional Pose Machines might be a good paper to start.

Convolutional Pose Machines uses a scaled heatmap as ground truth and network output. After the backbone CNN, VGG in the paper, but I think res-net works too. e.g. After 16 strides of pooling, the shape of heatmap is (batch, height/16, width/16, 16). Then use the same loss function.

xonxt · Answer 2 · 2019-03-21T14:56:12.827

Sorry for resurrecting an old thread, but I have pretty much the same problem.

Except in my case I'm taking almost exactly the same structure that the OpenPose project uses (i.e., VGG plus several branched stages), but with fewer stages, with the depth-image as input and only three keypoints (for my project I only need both hands and face).

The depth-image is the input, and the output is a heatmap which is a scaled down (1, h/8, w/8, 3)-stack with Gaussian peaks in place of keypoints (looks somewhat like this: ).

Loss is calculated same as in post above, with the difference between GT-heatmap and the predicted heatmap.

And yeah, during training, pretty much after the first 2-3 epochs I can see that the output becomes just an empty image full of zeros (and maybe a small peak at the [0;0] location) and stays that way forever (I've only had enough patience for about 1500 epochs).

I'm using Keras for training by the way.

I refered to the loss definition of YOLO and changed my own criterion. See https://i.stack.imgur.com/R3bUQ.png . A Gaussian peak is too hard to learn, so I only left one pixel with value 1. There are two terms regarding the heatmap (or confidence map): a) The confidence loss of the single pixel, as the first term in the circle denotes, and b) The confidence loss of areas where no target is (for example the pixels in your current gt image with values lower than 0.1). Term _a)_ encourages the model to learn finding targets, while _b)_ restraints it from finding anything irrelevant. — Harper Long, Mar 23 '19 at 02:55

TensorFlow - keypoint detection yields a heatmap of zeros

2 Answers2