1

Background:

If I am not mistaken, when training a network we feed forward performing sigmoid(sum(W*x)) for every layer then in back-propagation we calculate the error and the deltas (change) then we calculated the gradients and update the weights.

Lets say we do not have an activation on one of the layers how can keras calculate the gradient? does it just take the value of sum(W*x)*next_layer_delta*weights to get the delta for the current layer and use this to calculate the gradients?

Code:

I have this code which I wrote to create a word2vec model (skip-gram):

model = Sequential()
model.add(Dense(2, input_dim=len(tokens_enc)))#what does it mean for it not to have an activation here? This makes it linear because there is no non-linear function such as tanh!
model.add(Dense(len(tokens_enc), activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
# Fit the model
model.fit(X, y, epochs=20000)

The input and output are 1 hot vectors.

The question: How does keras optimize the weights in this scenario and what are the implications of not having an activation function in a hidden layer?

Kevin
  • 3,077
  • 6
  • 31
  • 77

2 Answers2

1

Normally, linear activation function is only applied to the last layer for some regression problems. Of course, you can still use it as hidden layers in a multi-layer network. However, in case you stack multiple linear layers next to each other, it will act as 1 linear layer so you can't build a big model with it. A linear activation function has local gradient = 1, therefore, the local gradient of a complete node is the weight itself.

0

Keras uses the automatic differentiation capabilities of Theano and TensorFlow (depending on which backend you are using), so Keras does not really do anything special about not having an activation function.

Gradients are computed by Theano/TensorFlow and they compute the correct ones.

Dr. Snoopy
  • 55,122
  • 7
  • 121
  • 140