2

The Keras LSTM implementation outputs kernel weights, recurrent weights and a single bias vector. I would have expected there to be a bias for both the kernel weights and the recurrent weights so I am trying to make sure that I understand where this bias is being applied. Consider the randomly initialized example:

test_model = Sequential()
test_model.add(LSTM(4,input_dim=5,input_length=10,return_sequences=True))
for e in zip(test_model.layers[0].trainable_weights, test_model.layers[0].get_weights()):
    print('Param %s:\n%s' % (e[0],e[1]))
    print(e[1].shape)

This will something like the following:

Param <tf.Variable 'lstm_3/kernel:0' shape=(5, 16) dtype=float32_ref>:
[[-0.46578053 -0.31746995 -0.33488223  0.4640277  -0.46431816 -0.0852727
   0.43396038  0.12882692 -0.0822868  -0.23696694  0.4661569   0.4719978
   0.12041456 -0.20120585  0.45095628 -0.1172519 ]
 [ 0.04213512 -0.24420211 -0.33768272  0.11827284 -0.01744157 -0.09241
   0.18402642  0.07530934 -0.28586367 -0.05161515 -0.18925312 -0.19212383
   0.07093149 -0.14886391 -0.08835816  0.15116036]
 [-0.09760407 -0.27473268 -0.29974532 -0.14995047  0.35970795  0.03962368
   0.35579181 -0.21503082 -0.46921644 -0.47543833 -0.51497519 -0.08157375
   0.4575423   0.35909468 -0.20627108  0.20574462]
 [-0.19834137  0.05490702  0.13013887 -0.52255917  0.20565301  0.12259561
  -0.33298236  0.2399289  -0.23061508  0.2385658  -0.08770937 -0.35886696
   0.28242612 -0.49390298 -0.23676801  0.09713227]
 [-0.21802655 -0.32708862 -0.2184104  -0.28524712  0.37784815  0.50567037
   0.47393328 -0.05177036  0.41434419 -0.36551589  0.01406455  0.30521619
   0.39916915  0.22952956  0.40699703  0.4528749 ]]
(5, 16)
Param <tf.Variable 'lstm_3/recurrent_kernel:0' shape=(4, 16) dtype=float32_ref>:
[[ 0.28626361 -0.21708137 -0.18340513 -0.02943563 -0.16822724  0.38830781
  -0.50277489 -0.07898639 -0.30247116 -0.01375726 -0.34504923 -0.01373435
  -0.32458451 -0.03497506 -0.01305341  0.28398186]
 [-0.35822678  0.13861786  0.42913082  0.11312254 -0.1593778   0.58666271
   0.09238213 -0.24134786  0.2196856  -0.01660753 -0.01929135 -0.02324873
  -0.2000526  -0.07921806 -0.33966202 -0.08963238]
 [-0.06521184 -0.28180376  0.00445012 -0.32302913 -0.02236169 -0.00901215
   0.03330055  0.10727262  0.03839845 -0.58494729  0.36934188 -0.31894827
  -0.43042961  0.01130622  0.11946538 -0.13160609]
 [-0.31211731 -0.24986106  0.16157174 -0.27083701  0.14389414 -0.23260537
  -0.28311059 -0.17966864 -0.28650531 -0.06572254 -0.03313115  0.23230191
   0.13236329  0.44721091 -0.42978323 -0.09875761]]
(4, 16)
Param <tf.Variable 'lstm_3/bias:0' shape=(16,) dtype=float32_ref>:
[ 0.  0.  0.  0.  1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
(16,)

I grasp that kernel weights are used for the linear transformation of the inputs so they are of shape [input_dim, 4 * hidden_units] or in this case [5, 16] and the kernel weights are used for the linear transformation of the recurrent weights so they are of shape [hidden_units, 4 * hidden_units]. The bias on the other hand is of shape [4 * hidden units] so it is conceivable that it could be added to the recurrent_weights, but not the input transformation. This example shows that the bias as it is output here can only be added to the recurrent_state:

embedding_dim = 5
hidden_units = 4


test_embedding = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
kernel_weights = test_model.layers[0].get_weights()[0]
recurrent_weights = test_model.layers[0].get_weights()[1]
bias = test_model.layers[0].get_weights()[2]

initial_state = np.zeros((hidden_units, 1))

input_transformation = np.dot(np.transpose(kernel_weights), test_embedding[0]) # + bias or + np.transpose(bias) won't work
recurrent_transformation = np.dot(np.transpose(recurrent_weights), initial_state) + bias

print(input_transformation.shape)
print(recurrent_transformation.shape)

Looking at this blog post there are biases added at pretty much every step, so I'm still feeling pretty lost as to where this bias is being applied.

Can anybody help me clarify where the LSTM bias is being added?

reese0106
  • 2,011
  • 2
  • 16
  • 46
  • 1
    Yes, you're correct. The bias is added after the matrix multiply – c2huc2hu Sep 19 '17 at 18:08
  • But only after the recurrent transformation? – reese0106 Sep 19 '17 at 18:25
  • 1
    It's added at every step if that's what you're asking – c2huc2hu Sep 19 '17 at 18:32
  • I am asking at which matrix multiple it is added. There is the matrix multiple for the inputs (ie. kernel weights) as well as the matrix multiply for the recurrent transformation (ie. the recurrent weights). Your first answer said it is added after the matrix multiple, but there are two matrix multiplications so it's not clear which one you are referring to. – reese0106 Sep 19 '17 at 19:05

1 Answers1

3

The bias is added to the recurrent cell after the matrix multiply. It doesn't matter whether it's added to inputs after the matmul or to the recurrent data after matmul because addition is commutative. See the LSTM equations below:

LSTM equations

c2huc2hu
  • 2,447
  • 17
  • 26
  • Thanks for the response. I think that I understand why I was getting confused. I had looked at these equations and as you can see there are 4 biases (bz, bi, bf, bo) so I was expecting to see 4 biases. However, now that I realize the bias shape is 4*hidden_dims, I am realize that this single vector contains all 4 biases concatenated together and this is why I was confused! – reese0106 Sep 19 '17 at 22:55
  • Yup, concatenating everything is a standard trick to improve efficiency – c2huc2hu Sep 20 '17 at 02:39