0

I am learning the http://neuralnetworksanddeeplearning.com/chap3.html.

It says the cross-entropy cost function can speed up the network, because the δ'(z) canceled on the last layer.

the partial derivative for the last layer L:

∂(C)/∂(w) = aL-1(aL-y).

there is no δ'(z).

But I want to know whether cross-entropy speed up the hidden-layer, so I calculate the partial derivative on the L-1 layer:

∂(C)/∂(w)

= (aL-y) * wL * aL-1(1-aL-1) * aL-2

= (aL-y) * wL * δ'(zL-1) * aL-2

It seems it not speed up on the L-1 layer, because δ'(x) still exists. I can be close to zero the make the partial derivative close to zero, and make the learning slowly.

Could someone tell me the wrong point I made ? Thanks.

Joey
  • 175
  • 7
  • No, the book is correct: CE doesn't "speed up" any layer except the last (usually Softmax). The other non-linearities (like ReLU) won't be "sped up". But with ReLU non-linearities, all you need is a proper initialization of the linear layers (Kaiming He Normal Init) in order for learning to remain fast. You've just rediscovered why optimization of deep neural nets sucks with sigmoids and rocks with properly-initialized ReLUs. – Iwillnotexist Idonotexist Mar 05 '17 at 05:04
  • Thanks for your help, I understand now. – Joey Mar 06 '17 at 04:51

0 Answers0