1

I'm trying to develop a fully-convolutional neural net to estimate the 2D locations of keypoints in images that contain renders of known 3D models. I've read plenty of literature on this subject (human pose estimation, model based estimation, graph networks for occluded objects with known structure) but no method I've seen thus far allows for estimating an arbitrary number of keypoints of different classes in an image. Every method I've seen is trained to output k heatmaps for k keypoint classes, with one keypoint per heatmap. In my case, I'd like to regress k heatmaps for k keypoint classes, with an arbitrary number of (non-overlapping) points per heatmap.

In this toy example, the network would output heatmaps around each visible location of an upper vertex for each shape. The cubes have 4 vertices on top, the extruded pentagons have 2, and the pyramids just have 1. Sometimes points are offscreen or occluded, and I don't wish to output heatmaps for occluded points.

enter image description here enter image description here

The architecture is a 6-6 layer Unet (as in this paper https://arxiv.org/pdf/1804.09534.pdf). The ground truth heatmaps are normal distributions centered around each keypoint. When training the network with a batch size of 5 and l2 loss, the network learns to never make an estimate whatsoever, just outputting blank images. Datatypes are converted properly and normalized from 0 to 1 for input and 0 to 255 for output. I'm not sure how to solve this, are there any red flags with my general approach? I'll post code if there's no clear problem in general...

Will Snyder
  • 51
  • 1
  • 3
  • 1
    How are you normalising the output between `0` and `255`? If you're doing something like `sigmoid(x)*255`, there might be a chance that you get stuck in the null gradient zone (Not sure if that's correct, but since most values in the output have to be `0`, I assume the gradient will be dominated by those, and by the time the optimisation reaches a point where the gradient from positive outputs (gaussian around keypoints) becomes significant, it gets killed by the sigmoid) – Ash Dec 14 '19 at 10:55
  • 1
    After reading more fundamentals I see what you're getting at. After switching to tanh activation and down weighting the zeros in the image, I'm getting the performance I desired. The precise weighting is still an issue but I think the sigmoid null gradient problem was the definite culprit. Thanks so much man! – Will Snyder Dec 20 '19 at 19:24
  • This is what you need: https://arxiv.org/abs/1611.08050, it allows multiple instances of the same class of key-point. – hafiz031 Dec 07 '20 at 21:25
  • @WillSnyder I am also trying to implement the exact same paper. Were you able to do that? – sreagm Dec 26 '20 at 06:40
  • @sreagm No, I think the [application I was investigating](https://github.com/WHSnyder/Brickthrough) wasn't properly suited for that paper, and I gave up trying to get the simpler example to work after abandoning the whole thing. Not to mention it was a bit over my head as a relative newcomer to deep learning, evidenced by the way I haphazardly cobbled together a couple different approaches... – Will Snyder Jan 03 '21 at 18:14
  • You are getting blank image as prediction as the area of the heatmap (foreground) is insignificant to the area of the rest of the image (background). Hence, even if the model predicts the whole image as background (blank) it is technically doing a pretty good job although it is not useful at all. So, to make the model understand where to focus on, you should change the loss function which can treat and penalize differently depending on foreground and background. This might be helpful: https://datascience.stackexchange.com/a/57063/80430 – hafiz031 Jul 22 '21 at 20:40

0 Answers0