This is one of the most interesting concepts that I came across while learning neural networks. Here is how I understood it:
The input Z to one layer can be written as a product of a weight matrix and a vector of the output of nodes in the previous layer. Thus Z_l = W_l * A_l-1
where Z_l
is the input to the Lth layer. Now A_l = F(Z_l)
where F is the activation function of the layer L
. If the activation function is linear then A_l
will be simply a factor K of Z_l
. Hence, we can write Z_l
somewhat as:
Z_l = W_l*W_l-1*W_l-2*...*X
where X
is the input. So you see the output Y
will finally be the multiplication of a few matrices times the input vector for a particular data instance. We can always find a resultant multiplication of the weight matrices. Thus, output Y
will be W_Transpose * X
. This equation is nothing but a linear equation that we come across in linear regression.
Therefore, if all the input layers have linear activation, the output will only be a linear combination of the input and can be written using a simple linear equation.