I am learning the http://neuralnetworksanddeeplearning.com/chap3.html.
It says the cross-entropy cost function can speed up the network, because the δ'(z) canceled on the last layer.
the partial derivative for the last layer L:
∂(C)/∂(w) = aL-1(aL-y).
there is no δ'(z).
But I want to know whether cross-entropy speed up the hidden-layer, so I calculate the partial derivative on the L-1 layer:
∂(C)/∂(w)
= (aL-y) * wL * aL-1(1-aL-1) * aL-2
= (aL-y) * wL * δ'(zL-1) * aL-2
It seems it not speed up on the L-1 layer, because δ'(x) still exists. I can be close to zero the make the partial derivative close to zero, and make the learning slowly.
Could someone tell me the wrong point I made ? Thanks.