2

I have a question about the implementation of LSTMs in Tensorflow… and especially with the application of seq2seq modelling (where you have an Encoder and Decoder).

In short: Learning a word embedding while using the seq2seq model, aren’t we having redundant weights?

A usual approach is to embed the input words (for the encoder) like word2vec vectors, however the modelling approach is also able to learn these embeddings. This means that when we are going to setup the variables for the encoder, we have an additional embedding matrix (outside the LSTM) that encodes our vocabulary. What my understand of the LSTM node, or let’s take the (simpler) RNN node, is that the following equation is applied

sigma(W \cdot x + U \cdot h + b)

Where we have the dimensions

(n_hidden \times n_feature) (n_feature \times 1) + (n_hidden \times n_hidden) (n_hidden \times 1) + (n_hidden \times 1). 

The value one (1) here can be replaced by the batch_size I believe. (correct me if Im wrong please)

My worry is around the W \cdot x part. Because in this case, the vector x is already an embedded vector and thus calculating W \cdot x feels redundant… if x were a one-hot encoded vector, then it would make sense IMO.

Can anyone tell me if my reasoning is sound, and if I am understanding this ‘extra embedding’ for the seq2seq models correctly.

EDIT: the only thing I can think of now is that you dont want to have the full embedding inside your LSTM for some reason... (why exactly is unclear to me). But I can imagine that you lose some flexibility if you need to set the n_hidden dimension of the LSTM to the vocabulary size. For example.. when your vocabulary size changes... or maybe training effort.. If anyone can confirm this, then that would be great :)

zwep
  • 1,207
  • 12
  • 26

0 Answers0