0

From the dropout paper:

"The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2. This ensures that for any hidden unit the expected output (under the distribution used to drop units at training time) is the same as the actual output at test time."

Why do we want to preserve the expected output? If we use ReLU activations, linear scaling of weights or activations results in linear scaling of network outputs and does not have any effect on the classification accuracy.

What am I missing?

MichaelSB
  • 3,131
  • 3
  • 26
  • 40
  • 1
    The answer is here: https://zhang-yang.medium.com/scaling-in-neural-network-dropout-layers-with-pytorch-code-example-11436098d426#:~:text=Because%20dropout%20is%20active%20only,dropped%20(set%20to%200). – Amir Jalilifard Jul 06 '23 at 12:32

1 Answers1

0

To be precise, we want to preserve not the "expected output" but the expected value of the output, that is, we want to make up for the difference in training (when we don't pass values of some nodes) and testing phases by preserving mean (expected) values of outputs.

In case of ReLU activations this scaling indeed leads to linear scaling of outputs (when they are positive) but why do you think it doesn't affect final accuracy of a classification model? At least in the end, we usually apply either softmax of sigmoid which are non-linear and depend on this scaling.

Mikhail Berlinkov
  • 1,624
  • 10
  • 15
  • Why would passing scaled output through softmax or sigmoid affect classification accuracy? The largest output would still remain the largest, right? – MichaelSB Dec 09 '18 at 23:27
  • Well, for instance, if we use sigmoid then there might be values that without scaling would mean 0 instead of 1. With softmax it also can be the case, actually. Only if you have it as the last layer and take a maximum of probabilities with softmax and only consider accuracy/precision/recall (e.g. not roc_auc_score which is sensitive to predicted probabilities) then the accuracy would be the same. – Mikhail Berlinkov Dec 10 '18 at 00:34
  • Are you talking about applying a hard threshold after sigmoid, e.g. binary classifier: class A if sigmoid(y) > 0.5, class B otherwise? – MichaelSB Dec 10 '18 at 01:41
  • Yes, and also if you don't apply it and measure score based on predicted probabilities (e.g. roc_auc_score) then this scaling also affects the score. – Mikhail Berlinkov Dec 10 '18 at 02:05
  • Ok, that makes sense. Do you see any other scenarios where the scaling might be necessary, for example, batch normalization? – MichaelSB Dec 10 '18 at 02:53
  • Scaling is used in batch normalization if this is what you mean. – Mikhail Berlinkov Dec 10 '18 at 12:57
  • What I meant is would not scaling the outputs from a dropout layer affect batch normalization? – MichaelSB Dec 10 '18 at 21:15
  • Batch normalisation tries to ensure that nodes have 0 mean and variance 1, so scaling of incoming weights to a layer with batch normalisation shouldn't affect it, that is, it would affect parameters of batch normalisation but not the activations of nodes in this layer. – Mikhail Berlinkov Dec 10 '18 at 21:28
  • I just tested this with a small convnet with ReLU in all layers, and softmax applied to the outputs. I scaled first layer pre-activations by 100 during test time only. The results (classification accuracy) were not affected. Same for scaling first layer weights. I also tested this with batch normalization - if I rely on train set stats during test, then yes, the accuracy is affected, however it I recompute the stats during test time, then scaling either pre-activations or weights in the first layer has no effect. – MichaelSB May 05 '19 at 01:46