From the dropout paper:
"The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2. This ensures that for any hidden unit the expected output (under the distribution used to drop units at training time) is the same as the actual output at test time."
Why do we want to preserve the expected output? If we use ReLU activations, linear scaling of weights or activations results in linear scaling of network outputs and does not have any effect on the classification accuracy.
What am I missing?