Background:
If I am not mistaken, when training a network we feed forward performing sigmoid(sum(W*x)) for every layer then in back-propagation we calculate the error and the deltas (change) then we calculated the gradients and update the weights.
Lets say we do not have an activation on one of the layers how can keras calculate the gradient? does it just take the value of sum(W*x)*next_layer_delta*weights
to get the delta for the current layer and use this to calculate the gradients?
Code:
I have this code which I wrote to create a word2vec model (skip-gram):
model = Sequential()
model.add(Dense(2, input_dim=len(tokens_enc)))#what does it mean for it not to have an activation here? This makes it linear because there is no non-linear function such as tanh!
model.add(Dense(len(tokens_enc), activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
# Fit the model
model.fit(X, y, epochs=20000)
The input and output are 1 hot vectors.
The question: How does keras optimize the weights in this scenario and what are the implications of not having an activation function in a hidden layer?