what is the meaning of 'per-layer learning rate' in Fast R-CNN paper?

Question

I'm reading a paper about Fast-RCNN model.

In the paper section 2.3 part of 'SGD hyper-parameters', it said that All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001

Is 'per-layer learning rate' same as 'layer-specific learning rate' that give different learning rate by layers? If so, I can't understand how they('per-layer learning rate' and 'global learning rate') can be apply at the same time?

I found the example of 'layer-specific learning rate' in pytorch.

optim.SGD([
                {'params': model.some_layers.parameters()},
                {'params': model.some_layers.parameters(), 'lr': 1}
            ], lr=1e-3, momentum=0.9)

According to paper, Is this the correct approach?

Sorry for may English

score 4 · Accepted Answer · answered Oct 01 '21 at 17:46

The per-layer terminology in that paper is slightly ambiguous. They aren't referring to the layer-specific learning rates.

All layers use a per-layer learning rate of 1 for weights and 2 for biases and a global learning rate of 0.001.

The concerned statement is w.r.t. Caffe framework in which Fast R-CNN was originally written (github link).

They meant that they're setting the learning rate multiplier of weights and biases to be 1 and 2 respectively.

Check any prototxt file in the repo e.g. CaffeNet/train.prototxt.

  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }

Thus, the effective learning rate is going to be base_lr*lr_mult, and here, the base learning rate is 0.001, which is defined in solver.prototxt.

Its not just "slightly" ambiguous, this totally changes the meaning IMO. — ayandas, Oct 01 '21 at 18:36

what is the meaning of 'per-layer learning rate' in Fast R-CNN paper?

1 Answers1