0

I'm trying to define custom loss function for Caffe using Python layer but I can't clarify what is a required output. Let's a function for the layer is defined as L = sum(F(xi, yi))/batch_size, where L is loss function to be minimized (i.e. top[0]), x is a network output (bottom[0]), y is ground truth label (i.e. bottom[1]) and xi,yi are i-th samples in a batch.

Widely known example with EuclideanLossLayer (https://github.com/BVLC/caffe/blob/master/examples/pycaffe/layers/pyloss.py) shows that backward level in this case must return bottom[0].diff[i] = dL(x,y)/dxi. Another reference I've found shows the same: Implement Bhattacharyya loss function using python layer Caffe

But in other examples I have seen that it should be multiplied by top[0].diff. 1. What is correct? bottom[0][i] = dL/dx or bottom[0].diff[i] = dL/dxi*top[0].diff[i]

Ilya Ovodov
  • 373
  • 1
  • 10

2 Answers2

1

Each loss layer may have loss_weight: indicating the "importance" of this specific loss (in case there are several loss layers for the net). Caffe implements this weight as top[0].diff to be multiplied by the gradients.

Shai
  • 111,146
  • 38
  • 238
  • 371
  • Yes. It was discussed in https://stackoverflow.com/a/31132209/5954475. Actually for loss level fop[0].diff is const = loss_weight. So in simpliest case it is 1 and it does not matter whether to multiply by it or not (like in an upper example) but more correct is to multiply (so the example is incomplete). – Ilya Ovodov Jun 02 '17 at 12:01
0

Let's back off to basic principles: the purpose of back-propagation is to adjust the layer weights according to the ground-truth feedback. The most basic parts of this include "how far off is my current guess" and "how hard should I yank the change lever?" These are formalized as top.diff and learning_rate, respectively.

At a micro level, the ground truth for each layer is that top feedback, so top.diff is the local avatar of "how far off ...". Thus at some point, you need to include top[0].diff as a primary factor in your adjustment computation.

I know this isn't a complete, direct answer -- but I hope it continues to help even after you solve the immediate problem.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • Really a thought the same but excamples show an opposite. – Ilya Ovodov Jun 01 '17 at 22:34
  • Okay. I did go to the pyloss source code you referenced, and I saw what you mean there. I don't know whether the difference in this template is critical: the implementations I've seen from this template (closely related) both added the `top[0].diff` factor. – Prune Jun 01 '17 at 22:37