The Keras LSTM implementation outputs kernel weights, recurrent weights and a single bias vector. I would have expected there to be a bias for both the kernel weights and the recurrent weights so I am trying to make sure that I understand where this bias is being applied. Consider the randomly initialized example:
test_model = Sequential()
test_model.add(LSTM(4,input_dim=5,input_length=10,return_sequences=True))
for e in zip(test_model.layers[0].trainable_weights, test_model.layers[0].get_weights()):
print('Param %s:\n%s' % (e[0],e[1]))
print(e[1].shape)
This will something like the following:
Param <tf.Variable 'lstm_3/kernel:0' shape=(5, 16) dtype=float32_ref>:
[[-0.46578053 -0.31746995 -0.33488223 0.4640277 -0.46431816 -0.0852727
0.43396038 0.12882692 -0.0822868 -0.23696694 0.4661569 0.4719978
0.12041456 -0.20120585 0.45095628 -0.1172519 ]
[ 0.04213512 -0.24420211 -0.33768272 0.11827284 -0.01744157 -0.09241
0.18402642 0.07530934 -0.28586367 -0.05161515 -0.18925312 -0.19212383
0.07093149 -0.14886391 -0.08835816 0.15116036]
[-0.09760407 -0.27473268 -0.29974532 -0.14995047 0.35970795 0.03962368
0.35579181 -0.21503082 -0.46921644 -0.47543833 -0.51497519 -0.08157375
0.4575423 0.35909468 -0.20627108 0.20574462]
[-0.19834137 0.05490702 0.13013887 -0.52255917 0.20565301 0.12259561
-0.33298236 0.2399289 -0.23061508 0.2385658 -0.08770937 -0.35886696
0.28242612 -0.49390298 -0.23676801 0.09713227]
[-0.21802655 -0.32708862 -0.2184104 -0.28524712 0.37784815 0.50567037
0.47393328 -0.05177036 0.41434419 -0.36551589 0.01406455 0.30521619
0.39916915 0.22952956 0.40699703 0.4528749 ]]
(5, 16)
Param <tf.Variable 'lstm_3/recurrent_kernel:0' shape=(4, 16) dtype=float32_ref>:
[[ 0.28626361 -0.21708137 -0.18340513 -0.02943563 -0.16822724 0.38830781
-0.50277489 -0.07898639 -0.30247116 -0.01375726 -0.34504923 -0.01373435
-0.32458451 -0.03497506 -0.01305341 0.28398186]
[-0.35822678 0.13861786 0.42913082 0.11312254 -0.1593778 0.58666271
0.09238213 -0.24134786 0.2196856 -0.01660753 -0.01929135 -0.02324873
-0.2000526 -0.07921806 -0.33966202 -0.08963238]
[-0.06521184 -0.28180376 0.00445012 -0.32302913 -0.02236169 -0.00901215
0.03330055 0.10727262 0.03839845 -0.58494729 0.36934188 -0.31894827
-0.43042961 0.01130622 0.11946538 -0.13160609]
[-0.31211731 -0.24986106 0.16157174 -0.27083701 0.14389414 -0.23260537
-0.28311059 -0.17966864 -0.28650531 -0.06572254 -0.03313115 0.23230191
0.13236329 0.44721091 -0.42978323 -0.09875761]]
(4, 16)
Param <tf.Variable 'lstm_3/bias:0' shape=(16,) dtype=float32_ref>:
[ 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
(16,)
I grasp that kernel weights are used for the linear transformation of the inputs so they are of shape [input_dim, 4 * hidden_units] or in this case [5, 16] and the kernel weights are used for the linear transformation of the recurrent weights so they are of shape [hidden_units, 4 * hidden_units]. The bias on the other hand is of shape [4 * hidden units] so it is conceivable that it could be added to the recurrent_weights, but not the input transformation. This example shows that the bias as it is output here can only be added to the recurrent_state:
embedding_dim = 5
hidden_units = 4
test_embedding = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
kernel_weights = test_model.layers[0].get_weights()[0]
recurrent_weights = test_model.layers[0].get_weights()[1]
bias = test_model.layers[0].get_weights()[2]
initial_state = np.zeros((hidden_units, 1))
input_transformation = np.dot(np.transpose(kernel_weights), test_embedding[0]) # + bias or + np.transpose(bias) won't work
recurrent_transformation = np.dot(np.transpose(recurrent_weights), initial_state) + bias
print(input_transformation.shape)
print(recurrent_transformation.shape)
Looking at this blog post there are biases added at pretty much every step, so I'm still feeling pretty lost as to where this bias is being applied.
Can anybody help me clarify where the LSTM bias is being added?