I am working with a faster-rcnn type of system where automated focal loss
was recently implemented from https://arxiv.org/pdf/1904.09048.pdf
In the above-linked paper in section 3.4. Regression
it states
We assume that the labels are distributed around the actual correct ground truth by a Gaussian distribution with a variance of σ^2.
and
However, to correctly compute the cumulative distribution function the variance σ^2 of the task needs to be estimated. [...] training the variable σ^2 like a weight of the network.
I do not have data for the task variance σ^2
.
I do not fully understand how it can be learned without having data for it.
Should I simply make the variable trainable
and assume that the optimize knows what to do?