0

Currently I am reading the following paper: "SqueezeNet: AlexNet-level accuracy with 50 x fewer parameters and <0.5 MB model size".

In this 4.2.3 (Activation function layer), there is the following statement:

The ramifications of the activation function is almost entirely constrained to the training phase, and it has little impact on the computational requirements during inference.

I understand the influence of activation function as follows. An activation function (ReLU etc.) is applied to each unit of the feature map after convolution operation processing. I think that processing at this time is the same processing in both the training mode and the inference mode. Why can we say that it has a big influence on training and does not have much influence on inference?

Can someone please explain it.

Maxim
  • 52,561
  • 27
  • 155
  • 209
rykami
  • 13
  • 1

1 Answers1

0

I think that processing at this time is the same processing in both the training mode and the inference mode.

You are right, the processing time of the activation function is the same. But still there is big difference between training time and test time:

  • Training time involves applying the forward pass for a number of epochs, where each epoch usually consists of the whole training dataset. Even for a small dataset, such as MNIST (consisting of 60000 training images) this accounts for tens of thousands invocations. Exact runtime impact depends on a number of factors, e.g. GPUs allow a lot of computation in parallel. But in any case it's several orders of magnitude larger than the number of invocations at test time, when usually a single batch is processed exactly once.

  • On top of that you shouldn't forget about the backward pass, in which the derivative of the activation is also applied for the same number of epochs. For some activations the derivative can be significantly more expensive, e.g. elu vs relu (elu has learnable parameters that need to be updated).

In the end, you are likely to ignore 5% slowdown at inference time, because the inference of a neural network it's blazingly fast anyway. But you might care about extra minutes to hours of training of a single architecture, especially if you need to do cross-validation or hyper-parameters tuning of a number of models.

Maxim
  • 52,561
  • 27
  • 155
  • 209